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Abstract 


In  this  thesis,  we  compare  the  computational  power  of  time  bounded  Parallel  Random 
Access  Machines  (PRAMs)  with  different  instruction  sets.  A  basic  PRAM  can  perform  the 
following  operations  in  unit-time:  addition,  subtraction,  Boolean  operations,  comparisons, 
and  indirect  addressing.  Multiple  processors  may  concurrently  read  and  concurrently  write  a 
single  cell.  Let  PRAM[op]  denote  the  class  of  PRAMs  with  the  basic  instruction  set 
augmented  with  the  set  op  of  instructions.  Let  T  and  1  denote  unrestricted  left  and  right 
shift,  respectively. 

We  prove  that  polynomial  time  on  a  PRAM[*]  or  on  a  PRAM[*,-i-]  or  on  a  PRAM[T,1] 
is  equivalent  to  polynomial  space  on  a  Turing  machine  ( PSP  ACE ).  This  extends  the  result 
that  polynomial  rime  on  a  basic  PRAM  is  equivalent  to  PSPACE  (Fortune  and  Wyllie,  1978) 
to  hold  when  the  PRAM  is  allowed  unit-rime  multiplication  or  division  or  unrestricted  shifts. 
It  also  extends  to  the  PRAM  the  results  that  polynomial  rime  on  a  random  access  machine 
(RAM)  with  multiplication  is  equivalent  to  PSPACE  (Hartmanis  and  Simon,  1974)  and  that 
polynomial  rime  on  a  RAM  with  shifts  (that  is,  a  vector  machine)  is  equivalent  to  PSPACE 
(Pratt  and  Stockmeyer,  1976;  Simon,  1977). 

This  thesis  establishes  that  the  class  of  languages  accepted  in  polynomial  time  on  a 
PRAM[*,T,i]  contains  the  class  of  languages  accepted  in  exponential  rime  on  a 
nondeterministic  Turing  machine  ( NEXPTIME )  and  is  contained  in  the  class  of  languages 
accepted  in  exponential  space  on  a  Turing  machine.  This  result  is  notable  because  if,  as  has 
been  conjectured,  NEXPTIME  properly  contains  PSPACE,  then  a  PRAM[*,t,l]  is  more 
powerful,  to  within  a  polynomial  factor  in  time,  than  a  PRAM  with  one  of  the  other 


instruction  sets. 
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We  present  efficient  simulations  of  PRAMs  with  enhanced  instruction  sets  by 
sequential  RAMs  with  the  same  instruction  sets.  This  thesis  presents  simulations  of 
probabilistic  PRAMs  by  deterministic  PRAMs,  using  parallelism  to  replace  randomness.  We 
also  give  simulations  of  PRAM[op]s  by  PRAMs,  where  both  the  simulated  machine  and  the 
simulating  machine  are  exclusive  read,  exclusive  write  machines. 
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Chapter  1.  Introduction 


An  important  model  of  parallel  computation  is  the  Parallel  Random  Access  Machine 
(PRAM),  which  comprises  multiple  processors  that  execute  instructions  synchronously  and 
share  a  common  memory.  Formalized  by  Fortune  and  Wyllie  (1978)  and  Goldschlager 
(1982),  the  PRAM  is  a  much  more  natural  model  of  parallel  computation  than  older  models 
such  as  combinational  circuits  and  alternating  Turing  machines  (Ruzzo,  1981)  because  the 
PRAM  abstracts  the  salient  features  of  a  modem  multiprocessor  computer.  Eventually  an 
algorithm  developed  for  the  PRAM  can  be  implemented  on  a  parallel  network  computer  such 
as  a  mesh-connected  array  computer  (Thompson  and  Kung,  1977),  a  hypercube  machine 
(Seitz,  1985),  a  cube-connected  cycles  machine  (Preparata  and  Vuillemin,  1981),  or  a 
bounded  degree  processor  network  (Alt  et  al.,  1987);  on  all  network  computers  the  routing  of 
data  complicates  the  implementation  of  algorithms. 

A  number  of  shared  memory  machines  have  been  built,  such  as  the  Cedar  (Kuck,  1986), 
Cray  X-MP  (Chen,  1983),  NYU  Ultracomputer  (Schwartz,  1980;  Gottlieb  et  al.,  1983),  and 
RP3  (Pfister  et  al.,  1985). 

x  The  PRAM  provides  the  foundation  for  the  design  of  highly  parallel  algorithms  (Luby, 
1986;  Miller  and  Reif,  1985;  among  many  others).  This  model  permits  the  exposure  of  the 
intrinsic  parallelism  in  a  computational  problem  because  it  simplifies  the  communication  of 
data  through  a  shared  memory. 

Because  of  the  widespread  use  of  the  PRAM  model,  further  advances  in  research  on 
parallel  computation  demand  a  thorough  understanding  of  its  capabilities.  In  particular,  we 
study  the  effect  of  the  instruction  set  on  the  performance  of  the  PRAM. 
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To  quantify  differences  in  computational  performance,  we  determine  the  time 
complexities  of  simulations  between  PRAMs  with  different  instruction  sets.  We  focus  on  the 
computational  complexity  of  simulations  between  PRAMs  with  the  following  operations; 

multiplication 

division 

arbitrary  left  shift 
arbitrary  right  shift 
probabilistic  choice 

We  prove  that  polynomial  time  on  PRAMs  with  unit-time  multiplication  and  division  or 
on  PRAMs  with  unit-time  ur  restricted  shifts  is  equivalent  to  polynomial  space  on  Turing 
machines  (TMs).  Consequently,  PRAMs  with  unit- time  multiplication  and  division  and 
PRAMs  with  unit- time  unrestricted  shifts  are  at  most  polynomially  faster  than  the  standard 
PRAMs,  which  do  not  have  these  powerful  instructions.  These  results  are  surprising  for  two 
reasons.  First,  for  a  sequential  random  access  machine  (RAM),  adding  unit-time 
multiplication  (*)  or  unit-time  unrestricted  left  shift  (T)  seems  to  increase  its  power: 

RAM-PTIME  =  PTIME  (Cook  and  Reckhow,  1973), 

RAM  [*  \-PTIME  =  PSPACE  (Hartmanis  and  Simon,  1974), 

RAM  [T }-PTIME  =  PSPACE  (Simon,  1977), 

whereas  adding  one  of  these  operations  to  a  PRAM  does  not  increase  its  power  by  more  than 
a  polynomial  in  time.  Second,  despite  the  potential  speed  offered  by  massive  parallelism,  a 
sequential  RAM  with  unit-cost  multiplication  or  unrestricted  shifts  is  just  as  powerful,  within 
a  polynomial  amount  of  time,  as  a  PRAM  with  the  same  additional  operation. 

The  basic  PRAM  has  unit-cost  addition,  subtraction,  Boolean  operations,  comparisons, 
and  indirect  addressing.  Let  PRAM[op]  denote  the  class  of  PRAMs  with  the  basic 
instruction  set  augmented  with  the  set  op  of  instructions.  Let  PRAM  [op  ]-TlME  ( T(n )) 
denote  the  class  of  languages  recognized  by  PRAM[op]s  in  time  O  ( T («))  on  inputs  of  length 
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n ,  PRAM  [op  ] -PTIME  the  union  of  PRAM  [op  [-TIME  (T  ( n ))  over  ail  polynomials  T  ( n ), 
and  PRAM  [op  ] -POLYLOGTIME  the  union  of  PRAM  [op  [-TIME  ( T(n ))  over  all  T  (rt )  that 
are  polynomials  in  log  n. 

We  establish  the  following  new  facts  about  PRAMs.  Recall  that  PSP  ACE  = 

PRAM -PTIME  (Fortune  and  Wyllie,  1978). 

PSP  ACE  =  PRAM  [*  [-PTIME  (1) 

=  PRAM[*,+]-PTlME 
=  PRAM  [-f-]-PTIME 
=  PRAM  [T,  i)- PTIME 

PRAM  -POLYLOGTIME  =  PRAM[*[-POLYLOGTlME  (21 

=  PRAM[*,+]-POLYLOGTIME 
=  PRAM  [+)-POLYLOGTIME 
=  PRAM^M-POLYLOGTIME 

N  EX  PTIME  C  PRAM  [*,  T]-PTIME  q  EXPSPACE  (3) 

PRAM  [*  ]-PTIME  =  RAM  [*  [-PTIME  (4) 

PRAM  [T,  i[- PTIME  =  RAM  [T,  l]-PTIME  (5) 

These  facts  follow  from  theorems  in  Chapters  4-8,  which  give  more  precise  time  and  space 

bounds. 

Chandra  and  Stockmeyer  (1976)  and  Gold^chlager  (1978)  put  forward  the  Parallel 
Computation  Thesis:  time  on  a  “reasonable”  parallel  machine  is  polynomially  related  to 
space  on  a  logarithmic-cost  sequential  machine  (for  example,  a  TM).  For  a  thorough 
discussion  of  restrictions  necessary  for  a  “reasonable”  parallel  machine,  see  Parberry 
(1987).  Basically,  a  parallel  machine  is  reasonable  if  the  number  of  processors  is  restricted 
to  an  exponential  and  the  length  of  cell  contents  is  bounded  by  an  exponential. 


4 


The  results  in  (1)  are  the  parallel  analogues  of  the  results  of  Hartmanis  and  Simon 
(1974)  and  Simon  (1977)  for  sequential  RAMs.  Because  of  the  very  long  numbers  that  the 
RAM[*]  and  RAM[T,1]  can  generate  and  because  of  the  equivalence  of  polynomial  time  on 
these  models  to  PSP  ACE,  the  RAM[*]  and  RAM[t,i]  have  sometimes  been  viewed  as 
“parallel.”  Thus,  the  PRAM[*]  and  PRAM[t,i]  may  be  viewed  as  “doubly  parallel.”  The 
results  in  (1)  are  therefore  also  significant  in  that  introducing  unbridled  parallelism  to  a 
random  access  machine  with  unit-time  multiplication  or  unit-time  unrestricted  shift 
decreases  the  running  time  by  at  most  a  polynomial  amount. 

The  results  in  (2)  are  notable  because  of  their  possible  implications  for  the  robust  class 
NC,  which  can  be  characterized  by  several  different  models  of  parallel  computation  (Cook, 
1985).  If  we  could  reduce  the  number  of  processors  used  by  the  simulation  of  a  PRAM[*], 
PRAM[*,-!-],  or  PRAMft  ,-1]  by  a  PRAM  from  an  exponential  number  to  a  polynomial 
number,  then  NC  would  be  the  languages  accepted  by  PRAM[*]s,  PRAM(*,+]s,  or 
PRAM[T,i]s,  respectively,  in  polylog  time  with  a  polynomial  number  of  processors. 

Simon  (1981a)  showed  that  PSP  ACE  c  RAM[*,  T  )-PTIME  c  EXPSPACE  The  results 
in  (3)  show  that  the  same  upper  bound  holds  for  a  parallel  RAM[*,t]  and  give  a  lower  bound 
stronger  than  PSP  ACE,  separating  a  PRAM  [*,  T]  from  PRAMs  with  the  instruction  sets 
previously  considered,  since  it  is  widely  believed  that  NEXPTIME  strictly  includes  PSP  ACE. 
This  result  does  not  contradict  the  Parallel  Computation  Thesis,  however,  since  the  numbers 
created  by  multiplication  and  shift  together  are  too  long  and  complex  to  be  “reasonable.” 

The  results  in  (4)  and  (5)  are  implied  by  (1),  but  we  note  these  because  we  strengthened 


the  time  bounds  by  more  direct  simulations  than  through  the  results  in  (1). 
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We  have  also  proved  some  results  for  probabilistic  PRAMs  (prob-PRAMs).  Reif 
(1984)  simulated  a prob-RAM[*,+]  M  with  time  bound  T (n),  memory  bound  S  ( n ),  and 
integer  bound  / (n)  by  a proh-PRAM[*,-i-]  in  time  0{S(n )  log  / (n)  +  log  T (n)).  We  simulate 
M  by  a  deterministic  PRAM[*,+]  in  time  0  ( S  ( n )  log  /  ( n )  log  (5  (n)7 («))),  then  extend  the 
simulation  to  a  prob-PRAM[*,-!-]. 

In  Chapter  2,  we  review  the  relevant  literature,  and  in  Chapter  3,  we  formally  define  our 
model.  Chapters  4  and  5  contain  the  multiplication  and  division  results.  Chapter  6  contains 
our  theorems  relating  to  PRAMs  with  shifts.  We  establish  bounds  on  the  computational 
power  of  time-bounded  PRAMs  with  both  multiplication  and  shift  in  Chapter  7.  Chapter  8 
contains  our  work  on  PRAMs  with  probabilistic  choice.  Chapter  9  contains  simulations  of 
PRAMs  with  enhanced  instruction  sets  by  sequential  RAMs  with  the  same  instruction  sets. 

In  Chapter  10,  we  discuss  the  effects  of  variations  in  the  definition  of  the  basic  PRAM  on  our 
simulations,  and  in  Chapter  1 1,  we  summarize  our  results  and  present  some  open  problems 
arising  from  our  work. 

A  preliminary  version  of  Chapters  4,  5,  and  6  appeared  at  the  22nd  Annual  Conference 
on  Information  Sciences  and  Systems  in  Princeton,  New  Jersey,  in  March  1988  (Trahan  et 
al.,  1988). 
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Chapter  2.  Literature  Review 

In  this  chapter,  we  survey  research  done  on  instruction  sets  for  RAMs  and  PRAMs.  We 
also  review  results  relating  the  PRAM  to  other  models  of  parallel  computation  and  briefly 
discuss  previous  work  on  probabilistic  models  of  computation.  Unless  otherwise  specified,  a 
RAM  has  unit-cost  addition,  subtraction,  Boolean  operations  on  bit  vectors,  conditional 
jumps,  and  indirect  reads  and  writes. 

•  Instruction  sets 

Hartmanis  (1971)  introduced  the  Random  Access  Stored  Program  (RASP)  machine,  the 
first  computational  model  with  random  access  memory.  Cook  and  Reckhow  (1973) 
presented  a  restricted  RAM  model  whose  instruction  set  did  not  include  Boolean  operations. 
We  shall  call  this  model  an  rRAM.  A  RASP  and  rRAM  can  simulate  each  other  with  at  most 
a  constant  factor  loss  in  time.  Let  l(y)  denote  the  execution  time  of  an  instruction  on  an 
rRAM,  where  y  is  the  size  of  the  operands.  They  simulated  a  Turing  machine  (TM)  running 
in  time  T (n)  by  an  rRAM  running  in  time  O  ( T ( n)-l  ( T (n)))  and  an  rRAM  running  in  time 
T{n)  by  a  TM  running  in  time  0(T*(n)),  if  l (y)  is  constant,  or  0(T2(n)),  if  /(y)  is 
logarithmic.  They  also  established  a  strict  time  hierarchy  for  rRAMs.  Wiedermann  (1983) 
improved  the  bound  for  the  logarithmic  time  measure  to  O  ( T2(n )  /  log  US  (n)),  where  US  (n) 
is  the  number  of  registers  used  by  the  rRAM. 

Schonhage  (1980)  proved  that  for  a  successor  RAM,  that  is,  a  RAM  without  Boolean 
operations  and  whose  set  of  arithmetic  instructions  is  restricted  to  adding  one  to  a  register’s 
contents,  successor  -RAM  -PTIME  =  P. 

The  four  papers  upon  which  our  work  is  squarely  based  are  Hartmanis  and  Simon 
(1974),  Pratt  and  Stockmeyer  (1976),  Simon  (1977),  and  Fortune  and  Wyllie  (1978). 
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Hartmanis  and  Simon  studied  the  RAM[*],  proving  RAM[*  \~PTIME  =  PSP  ACE. 

Thus,  the  inclusion  of  multiplication  strengthens  the  RAM[*]  over  the  RAM  of  Cook  and 
Reckhow.  To  simulate  a  RAM[*]  on  a  Turing  machine,  Hartmanis  and  Simon  treated  the 
long  numbers  generated  with  multiplication  by  manipulating  only  individual  bits  of  registers. 
In  only  polynomial  space,  a  TM  can  address  any  bit  of  a  register  whose  contents  can  grow 
exponentially  long.  They  also  established  that  the  same  results  hold  if  the  multiplication 
instruction  is  replaced  by  an  instruction  for  the  concatenation  of  two  strings. 

Pratt  and  Stockmeyer  studied  a  restricted  RAM[t,l],  or  vector  machine,  as  they  called 
it.  A  vector  machine  does  not  have  arithmetic  operations,  and  shift  distances  are  restricted  to 
a  polynomial,  hence  restricting  the  lengths  of  register  contents  to  an  exponential.  They 
proved  that  the  class  of  languages  recognized  in  polynomial  time  on  a  vector  machine  is 
equal  to  PSP  ACE,  also  by  manipulating  individual  bits  of  registers.  A  vector  machine  is 
often  viewed  as  a  parallel  computer  in  which  each  processor  handles  a  single  bit:  a  Boolean 
operation  is  seen  as  an  instruction  executed  simultaneously  by  each  processor  and  a  shift  is 
seen  as  a  communication  step  between  processors.  Simon  (1977)  removed  the  restrictions  on 
shift  distances  and  allowed  addition  and  subtraction,  showing  that  the  class  of  languages 
recognized  in  polynomial  time  on  this  machine  (a  RAM[t,i]  in  our  notation)  is  still  equal  to 
PSP  ACE.  He  dealt  with  the  extremely  long  numbers  that  a  RAM[t,i]  can  generate  by 
working  with  encodings  of  register  contents.  We  discuss  this  encoding  in  detail  in  Chapter  6. 

Fortune  and  Wyllie  (1978)  introduced  the  PRAM  model,  establishing  that  the  class  of 
languages  recognized  in  polynomial  time  on  this  model  is  also  equal  to  PSP  ACE.  They 
showed  that  in  space  O  (' T2{n )),  a  TM  can  simulate  a  PRAM  running  in  time  T(n).  The  TM 
executes  a  procedure  that  checks  that  at  time  t,  processor  Pj  executed  instruction  number  i. 
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leaving  c  in  its  accumulator.  This  procedure  is  recursive  in  time.  Fortune  and  Wyllie  also 
showed  that  for  nondeterministic  PRAMs,  the  class  of  languages  recognized  in  polynomial 
time  is  equal  to  the  class  of  languages  recognized  in  exponential  time  by  a  nondeterministic 
TM. 

Hartmanis  and  Simon  (1974),  Pratt  and  Stockmeyer  (1976),  and  Fortune  and  Wyllie 
(1978)  all  used  basically  the  same  method  to  simulate  a  space-bounded  TM.  For  all  pairs  of 
TM  configurations  a  and  (5,  the  simulating  machine  Q  first  generates  a  transition  matrix 
indicating  whether  the  simulated  TM  can  make  a  transition  from  a  to  p  in  one  step.  By 
successive  squarings,  Q  then  computes  the  transitive  closure  of  the  matrix,  which  gives  the 
T  (n)-step  transition  matrix,  and  reads  from  the  resulting  matrix  whether  the  TM  starting  in 
the  initial  configuration  reaches  an  accepting  configuration  in  T (n)  steps.  Fortune  and 
Wyllie  assigned  one  processor  to  each  configuration;  Pratt  and  Stockmeyer  put  the  entire 
transition  matrix  into  a  single  register,  then  squared  it  with  shifts  and  Boolean  operations; 
and  Hartmanis  and  Simon  did  the  same  as  Pratt  and  Stockmeyer,  using  multiplication  as  a 
restricted  shift  operator.  It  is  interesting  to  note  that  Hartmanis  and  Simon  used  only  the 
shifting  ability  of  multiplication,  but  no  other  property. 

Recall  that  a  vector  machine  is  a  RAM[1\1]  without  addition  or  subtraction  and  whose 
register  content  lengths  are  bounded  by  an  exponential.  Stockmeyer  (1976)  established  a 
separation  between  time-bounded  vector  machines  and  time-bounded  RAM[*]s  without 
Boolean  operations.  He  described  a  language  that  can  be  accepted  in  constant  time  by  a 
vector  machine,  but  requires  linear  time  on  a  RAM[*]  without  Boolean  operations. 

Let  RP  denote  the  class  of  languages  accepted  in  polynomial  time  on  a  probabilistic 
Turing  machine.  (We  discuss  RP  later  in  this  chapter.)  In  some  early  work  focusing  on 
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powerful  instruction  sets,  Schonhage  (1979)  demonstrated  that  for  a  RAM  without  Boolean 
operations,  RAM  [*  ]-PTlME  c  RP  and  NP  c  RAM  [*,  +]-PTIME.  Thus,  the  exclusion  of 
Boolean  operations  appears  to  weaken  the  RAM[*]  model  since  RP  c  PSP  ACE. 

Two  parallel  models  without  shared  memory  were  presented  by  Goldschlager  (1982) 
and  Savitch  and  Stimson  (1979).  Goldschlager  called  his  model  a  conglomerate.  A 
conglomerate  is  an  infinite  collection  of  identical  finite-state  machines  connected  according 
to  some  connection  function.  The  computing  ability  of  a  conglomerate  arises  from  the 
connection  function.  Goldschlager  argued  that  the  set  of  feasibly  buildable  machines  is  the 
set  of  conglomerates  whose  connection  functions  can  be  computed  in  polynomial  space  on  a 
TM.  Savitch  and  Stimson  called  their  model  a  PRAM;  let  us  call  it  a  free- PR  AM  to 
distinguish  it  from  our  version.  The  rree-PRAM  differs  from  our  PRAM  in  the 
communication  between  processors.  In  the  PRAM,  any  processor  may  communicate  with 
any  other  through  a  shared  memory.  In  the  free-PRAM,  there  is  no  shared  memory;  a 
processor  may  communicate  only  with  its  parent  (the  processor  that  activated  it)  and  its 
children  (the  processors  that  it  activates).  Savitch  and  Stimson  proved  the  equivalence  of 
polynomial  time  on  a  free-PRAM  and  polynomial  space  on  a  RAM,  where  space  on  a  RAM 
is  defined  as  the  sum  of  the  lengths  of  the  contents  of  the  registers  at  any  time. 

Savitch  (1982)  considered  the  free-PRAM  with  an  expanded  instruction  set,  allowing 
the  processors  three  string  manipulation  instructions:  concatenation  of  two  strings, 
extraction  of  the  first  half  of  a  string,  and  extraction  of  the  second  half  of  a  string.  He  proved 
that  the  class  of  languages  accepted  in  polynomial  time  by  this  model  equals  PSP  ACE. 
Hartmanis  and  Simon  (1974)  proved  that  the  class  of  languages  recognized  in  polynomial 
time  on  a  sequential  RAM  with  concatenation  is  equal  to  PSP  ACE,  so  the  parallelism 
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allowed  in  the  rree-PRAM  with  the  same  instruction  set  provides  no  more  power,  to  within  a 
polynomial  in  time. 

Division  is  a  natural  instruction  for  us  to  consider  as  a  part  of  our  instruction  set. 
Hartmanis  and  Simon  (1974)  proved  that/?AAf  [*,  +]-PTIME  -PSPACE.  Simon  (1981a) 
showed  that  a  RAM  with  division  and  left  shift  can  be  extremely  powerful.  He  proved 
RAM  [T ,+]-PTIME  =  ER,  where  ER  is  the  class  of  languages  accepted  in  time 


by  Turing  machines.  At  first  glance,  this  is  a  surprising  result:  for  a  RAM  with  left  shift  and 
right  shift,  we  already  know  that  RAM  [T,  1  ]-PTIME  =  PSPACE.  Simon  proved  ER  £ 
RAM[\ ,+]-PTIME  by  building  very  long  integers  with  the  left  shift  operation  and  then 
manipulating  them  as  both  integers  and  binary  strings.  Division  is  used  to  generate  a 
complex  set  of  strings  representing  all  possible  TM  configurations.  (Note  that  right  shift 
cannot  replace  division  in  building  these  strings.) 

In  the  same  paper,  Simon  studied  the  inclusion  of  both  multiplication  and  shift  in  the 
instruction  set  and  a  probabilistic  version  of  a  RAM.  Using  the  same  encoding  that  he  had 
previously  used  to  simulate  a  RAM[t,i],  he  proved  RAM[*,  T ]-PTIME  c  EXPSPACE.  A 
RAM  with  multiplication  and  shift  can  generate  more  complex  numbers  than  a  RAM  with 
shift  alone;  hence,  the  size  of  the  encoding  of  the  numbers  increases.  This  complexity 
accounts  for  the  increase  from  PSPACE  to  EXPSPACE.  Simon  called  his  probabilistic 
model  a  stochastic  RAM.  This  RAM  has  a  random  number  generator  that,  on  operand  jc, 
returns  a  random  integer  uniformly  distributed  in  the  interval  [0,  x],  A  stochastic  RAM 
accepts  input  co  if  the  probability  of  reaching  an  accepting  state  is  greater  than  Vi.  Simon 
exploited  this  acceptance  condition  in  his  proofs,  proving  the  following  results: 
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stochastic  -RAM  [T ]-PTlME  =  ER,  NEXPTIME  c  stochastic  -RAM  [*  ]-PTIME ,  and 
NEXPTIME  c  stochastic  -PRAM  -PTIME.  Simon  claimed  that  this  model  is  equivalent  to  a 
pro/?- PRAM  where  each  processor  has  an  unbiased  coin  of  its  own  (that  is,  if  the  stochastic 
RAM  has  sufficiently  powerful  arithmetic  instructions  (*  or  T)  to  generate  long  numbers 
quickly  enough). 

•  Relationships  between  RAMs  and  TMs 

Hopcroft  et  al.  (1975)  described  a  simulation  of  a  TM  running  in  time  T{n)>  n  log  n 
by  a  RAM  running  in  time  O  {Tin)  /  log  T («)).  The  key  to  the  simulation  is  that,  for  each 
block  of  TM  tape  of  size  O  (log  T  (n)),  the  RAM  precomputes  a  look-up  table  containing  the 
contents  of  that  block  after  O  (log  T  (n))  steps  of  the  TM. 

Katajainen  et  al.  (1988)  proved  that  a  T (n)  time-bounded,  S  (n)  space-bounded,  and 
U (n)  output- length-bounded  TM  can  be  simulated  by  a  RAM  in 
0{T{n)  +  ( n+U (n))  loglog  S (n))  time. 

Measuring  space  on  a  RAM  as  the  sum  of  the  lengths  of  the  contents  of  all  registers 
used  by  the  RAM,  Slot  and  van  Emde  Boas  (1988)  established  that  an  off-line  TM  running  in 
space  S{n)  can  simulate  an  off-line  RAM  running  in  space  5 (n).  They  also  showed  that  a 
simulation  with  no  loss  in  space  is  not  possible  for  the  on-line  versions. 

•  Relationships  between  PRAMs  and  other  computational  models 

Stockmeyer  and  Vishkin  (1984)  established  that  parallel  time  and  number  of  processors 
on  a  concurrent  read,  concurrent  write  (CRCW)  PRAM  correspond  to  depth  and  size  for 
unbounded  fan-in  circuits.  Time  and  depth  correspond  to  within  a  constant  factor,  number  of 
processors  and  size  correspond  to  within  a  polynomial.  Because  we  will  use  their  results 
frequently,  we  state  their  theorems  here. 
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Theorem  2.1.  (Stockmeyer  and  Vishkin,  1984)  Let  Z  be  a  PRAM  with  time  bound  T  (ai)  and 
processor  bound  P  (n).  There  is  an  unbounded  fan-in  circuit  Cn  that  simulates  Z  in  depth 
0(T (n ))  and  size  0(P(n)T{n)[T{n){n+T(n))  +  (n+T(n))3  +  (n+T(n))(n+P  (n)T(n))]). 

Proof  (sketch).  See  the  proof  of  Lemma  9.3. 1.1  for  a  description  of  the  circuit  C„.  □ 

Theorem  2.2.  (Stockmeyer  and  Vishkin,  1984)  Let  C  be  an  unbounded  fan-in  circuit  of  size 
S  and  depth  T  with  n  inputs.  There  is  a  PRAM  Z  that  simulates  C  in  time  O  ( T )  with 
O  (S  +  n )  processors. 

Proof  (sketch).  Each  processor  of  Z  simulates  a  wire  in  C,  and  each  memory  cell  in  Z 
simulates  a  gate  in  C.  Machine  Z  simulates  C  one  level  at  a  time.  To  simulate  an  OR  gate, 
the  corresponding  cell  is  initially  set  to  0.  A  processor  that  is  simulating  a  wire  into  that  gate 
writes  1  into  the  cell  if  its  wire  carries  a  1,  and  it  does  not  write  if  its  wire  carries  a  0.  To 
simulate  an  AND  gate,  the  corresponding  cell  is  initially  set  to  1.  A  processor  that  is 
simulating  a  wire  into  that  gate  writes  0  into  the  cell  if  its  wire  carries  a  0,  and  it  does  not 
write  if  its  wire  carries  a  1.  □ 

Stockmeyer  and  Vishkin  also  related  time  and  number  of  processors  on  a  CRCW 
PRAM  to  number  of  alternations  and  space  on  an  alternating  TM. 

The  Hardware  Modification  Machine  (HMM)  was  defined  by  Dymond  and  Cook 
(1980).  An  HMM  comprises  a  set  of  finite-state  machines,  called  units.  Each  unit  reads  a 
constant  number  of  input  symbols  from  neighboring  units  and  computes  an  output  symbol 
based  on  its  inputs  and  current  state  at  each  time  step.  Each  unit  has  “taps”  on  the  outputs 
of  other  units  through  which  it  reads  its  inputs.  At  each  step,  a  unit  may  relocate  one  of  its 
taps.  Ruzzo  (1985)  defined  a  restricted  PRAM  (rPRAM)  with  the  following  constraints: 
each  processor  has  a  finite  local  memory;  there  are  no  Boolean  operations;  the  only 
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arithmetic  instructions  are  successor  and  doubling;  and  shared  memory  is  divided  into 
blocks,  where  processor  Pt  owns  the  ith  block.  Ruzzo  showed  that  HMMs  and  rPRAMs  are 
equivalent,  to  within  a  constant  factor,  in  both  time  and  hardware  simultaneously. 

Dymond  and  Tompa  (1985)  established  that  for  all  T(n)  >  n, 

DTIME  ( T (n))  £  PRAM  -TIME  (T‘h{n)).  Their  simulation  features  the  use  of  a  look-up 
table  like  that  of  Hopcroft  et  al.  (1975). 

Hong  (1986)  gave  the  following  informal  definitions  of  space,  parallel  time,  and 
sequential  time  as  they  apply  to  various  models. 

(1)  The  space  (width)  is  the  maximum  length  of  intermediate  information  in 
the  computation. 

(2)  The  parallel  dme  (reversal)  is  the  total  number  of  phases.  A  phase  is  a 
period  of  the  computation  during  which  no  information  written  on  the  work 
space  is  read  during  the  same  period. 

(3)  The  sequential  time  is  the  total  number  of  primitive  operations  during  the 
computation. 

He  related  these  complexity  measures  in  the  Similarity  Principle:  “All  idealized 
computational  models  are  similar  in  the  sense  that  their  parallel  time,  their  space,  and  their 
sequential  time  complexities  are  polynomially  related  simultaneously.”  In  support  of  the 
Similarity  Principle,  Hong  established  that  the  following  computational  models  are  similar: 
TM  (reversal,  space ) 

RAM{*,+]  (reversal,  space) 

Vector  Machine  (time,  space) 

Uniform  Circuit  (depth,  width) 

P/?AA/[*,+]  (time,  space) 


Parberry  (1986)  demonstrated  that,  with  enough  processors,  a  CRCW  PRAM  can 
compute  any  recursive  function  in  constant  time. 

Ranade  ( 1987)  presented  a  simulation  of  a  P  processor  CRCW  PRAM  on  a  P  node 
butterfly  network  such  that  the  network  simulates  each  step  of  the  PRAM  in  O  (log  P )  time 
with  high  probability.  He  used  randomness  only  to  select  a  hash  function  to  distribute  the 
shared  memory  of  the  PRAM  among  the  nodes  of  the  butterfly  network.  Routing  in  the 
network  is  deterministic  and  oblivious. 

Alt  ei  dl.  (1987)  gave  a  nonuniform  deterministic  simulation  of  an  exclusive  read, 
exclusive  write  (EREW)  PRAM  on  a  bounded  degree  processor  network.  If  the  PRAM  has 
P  processors  and  5  cells  of  shared  memory,  then  the  network  simulates  each  step  in 
O  (log  P  log  5)  time.  If  the  PRAM  is  CRCW,  then  the  simulation  time  increases  to 
O  (log 2P  log  5).  For  a  restricted  network,  they  proved  a  lower  bound  of 
Q(log  P  log  S  I  loglog  S)  time  to  simulate  an  EREW  PRAM. 

Cook  (1980),  Vishkin  (1983b),  and  Parberry  (1987)  provided  good  surveys  on  parallel 
models  of  computation.  Karp  and  Ramachandran  (1988)  gave  a  good  survey  of  parallel 
algorithms. 

•  Probabilistic  models 

To  close  this  chapter,  we  survey  some  of  the  work  done  on  probabilistic  models  of 
computation. 

de  Leeuw  et  al.  (1956)  showed  that  the  ability  to  make  random  choices  does  not  change 
the  (unbounded)  computational  power  of  Turing  machines  (TMs).  Santos  (1969,  1971) 
investigated  a  more  general  notion  of  probabilistic  Turing  machine  (PTM)  allowing  biased 
random  choices;  his  PTMs  have  the  same  power  as  PTMs  allowing  only  unbiased  choices. 
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Gill  (1977)  defined  a  PTM  as  being  allowed  to  toss  an  unbiased  coin.  He  defined  Blum 
complexity  measures  for  time  and  space.  PP  is  defined  as  the  class  of  languages  recognized 
by  polynomial  time  bounded  PTMs  (no  error  bounds).  BPP  is  the  class  of  languages 
recognized  by  polynomial  time  bounded  PTMs  with  bounded  error  probability.  VPP  is  the 
class  of  languages  recognized  by  polynomial  time  bounded  PTMs  with  zero  error  probability 
for  inputs  not  in  the  language.  (Note:  VPP  is  more  often  designated  R  or  RP,  I  will 
henceforth  use  RP  to  designate  this  class.)  ZPP  is  the  class  of  languages  recognized  by 
PTMs  with  polynomial  bounded  average  time  and  zero  error  probability  (RP  p,  co  -RP). 
Gill  showed  the  following: 

QBPP  c 

P  C  ZPP  <ZRP  PP  C  PSP  ACE. 

CNPC 

He  also  showed  that  only  partial  recursive  functions  are  probabilistically  computable.  He 
described  a  palindrome-like  language  that  can  be  recognized  by  a  fixed  one-tape  PTM  faster 
than  by  any  one-tape  deterministic  TM  (DTM).  He  proved  that  a  PTM  with  time  bound 
T ( n )  can  be  simulated  by  a  DTM  in  time  O  (72(n)2r(n)). 

Simon  (1981b)  proved  RPSPACE  =  PSP  ACE  (where  RPSPACE  is  polynomial  space  on 
a  PTM).  He  accomplished  this  by  showing  RPSPACE  =  RAM  [*  ]-PTIME\  the  RAM[*] 
treats  the  computation  of  the  PTM  as  a  Markov  chain  to  compute  the  probability  of 
acceptance. 

Rabin  (1976)  gave  a  general  discussion  of  probabilistic  algorithms,  considering  three 
types:  (1)  algorithms  that  halt  within  expected  time  /  ( n ),  always  producing  the  correct 
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answer  and  always  terminating  (sometimes  called  Las  Vegas  algorithms),  (2)  algorithms  that 
may  produce  an  erroneous  answer,  and  (3)  algorithms  that  may  produce  an  erroneous  answer 
and  may  (with  probability  0)  not  halt.  (Algorithms  of  types  (2)  and  (3)  are  sometimes  called 
Monte  Carlo  algorithms.)  The  randomness  he  allowed  is  choosing  an  integer  in  { 1 . n } . 

Welsh  (1983)  surveyed  various  randomized  algorithms.  He  also  surveyed  knowledge 
about  RP  and  random  log-space  ( RL :  L  £  RL  £  NL  £  P). 

Reif  (1984)  investigated  pro6-PRAM[*,+]  simulations  of  the  pro6-RAM[*,+].  He  used 
the  CREW  version.  We  present  his  results  in  detail  in  Chapter  8.  Reif  also  showed  how 
probabilistic  choice  can  be  removed  from  proh-PRAMs  with  two-sided  error  by  introducing 
nonuniformity,  with  some  increase  in  time  and  processor  bounds. 

Robson  (1984)  demonstrated  that  aprob-RAM  can  simulate  a  deterministic  one-tape 
TM  in  expected  time  0(T(n)  /  loglog  T ( n ))  under  the  log-cost  criterion.  The  prob- RAM 
handles  the  TM  tape  in  blocks.  Note  that  Hopcroft  er  al  (1975)  showed  a  simulation  of  a 
multitape  TM  in  O  (T(n)  /  log  T ( n ))  unit-cost  time  on  a  deterministic  RAM. 
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Chapter  3.  Definitions  and  Two  Key  Lemmas 

ram  n.  1.  A  male  sheep.  2.  (Capital  R)  A  constellation  and  sign  of 
the  zodiac,  Aries.  3.  Any  of  several  devices  used  to  drive,  batter,  or 
crush  by  forceful  impact. 

pram  n.  (Chiefly  British)  A  perambulator:  a  baby  carriage. 

(Morris,  1980) 

We  study  a  deterministic  PRAM  similar  to  that  of  Stockmeyer  and  Vishkin  (1984).  A 
PRAM  consists  of  an  infinite  collection  of  processors  Pq,  P  i,  ■  ■  ■  ,  an  infinite  set  of  shared 
memory  cells,  c  (0),c(l),  •  •  • ,  and  a  program  which  is  a  finite  set  of  instructions  labeled 
consecutively  with  1,  2,  3,  •  •  •  .  All  processors  execute  the  same  program.  Each  processor 
has  a  program  counter.  Each  processor  Pm  has  an  infinite  number  of  local  registers:  rm( 0), 
rm(  1),  .  Each  cell  c(j),  whose  address  is  j,  contains  an  integer  con  (J),  and  each  register 

rm(J)  contains  an  integer  rconm(J). 

For  convenience  we  use  a  PRAM  with  concurrent  read  and  concurrent  write  (CRCW) 
in  which  the  lowest  numbered  processor  succeeds  in  a  write  conflict  At  time  t  in  a 
computation  of  a  PRAM,  at  most  2'  processors  are  active.  Since  we  are  concerned  with  at 
ieast  polylog  time,  there  are  no  significant  differences  between  the  concurrent  read  / 
concurrent  write  (CRCW),  the  concurrent  read  /  exclusive  write  (CREW),  and  the  exclusive 
read  /  exclusive  write  (EREW)  PRAMs  because  the  EREW  model  can  simulate  the  CRCW 
model  with  a  penalty  of  only  a  logarithmic  factor  in  time  (log  of  the  number  of  processors 
attempting  to  simultaneously  read  or  write)  (Cook  et  al„  1986;  Borodin  and  Hopcroft,  1985; 
Fich  et  al.,  1988b;  Vishkin,  1983a).  If  one  or  more  processors  attempt  to  read  a  cell  at  the 
same  time  that  a  processor  is  attempting  to  write  the  same  cell,  then  all  reads  are  performed 


before  the  write. 
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Initially,  the  input,  a  nonnegative  integer,  is  in  c  (0).  For  all  m,  register  rm( 0)  contains 
m.  All  other  cells  and  registers  contain  0,  and  only  Pq  is  active.  A  PRAM  accepts  its  input 
if  ar»d  only  if  Pq  halts  with  its  program  counter  on  an  ACCEPT  instruction. 

In  time  O  (log  n),  a  processor  can  compute  the  smallest  n  such  that  con  (0)  <  2"  -  1;  the 
PRAM  takes  this  n  as  the  length  of  the  input.  Whenever  con  ( i )  is  interpreted  in  two’s 
complement  representation,  we  number  the  bits  of  con(i)  consecutively  with  0,  1,  2, 
where  bit  0  is  the  rightmost  (least  significant)  bit. 

A  PRAM  Z  has  time  bound  T (n)  if  for  all  inputs  co  of  length  «,  a  computation  of  Z  on  co 
halts  in  T(n)  steps.  Z  has  proc  ssor  bound  P  )  if  for  all  inputs  co  of  length  n ,  Z  activates  at 
most  P  (n)  nrocessors  dunng  a  computi:  ->n  on  co.  We  assume  that  T (n)  and  log  P  (n)  are 
bori  tune-cons nactible  in  the  simulations  of  a  PRAM[op]  by  a  PRAM,  so  that  all  processors 
have  values  of  T  (n )  and  P  (n ). 

We  allow  indirect  addressing  of  registers  and  shared  memory  cells  through  register 
contents.  The  notation  c  ( rm(J ))  refers  to  the  cell  of  shared  memory  whose  address  is 
rconm(j),  and  r(rm{j))  refers  to  the  register  of  Pn  whose  address  is  rconm{j). 

The  basic  PRAM  model  has  the  following  instructions.  When  executed  by  processor 
Pm,  an  instruction  that  refers  to  register  r(i)  uses  rm(i). 

r  ( i)<—k  (load  a  constant) 
r(i)*—r(J)  (load  the  contents  of  another  register) 
r  ( i )«—  c  ( r  (J))  (indirect  read  from  shared  memory) 
c(r(i))<r-r(j)  (indirect  write  to  shared  memory) 
r  ( i  )<—  r  ( r  (/'))  (indirect  read  from  local  memory) 
r(r  (/))<—  r(J)  (indirect  write  to  local  memory) 

ACCEPT  (halt  and  accept) 

REJECT  (halt  and  reject) 

FORK  label  1,  label  2  ( Pm  halts  and  activates  P  2m  ar>d  Pim+\ -  setting  their 
program  counters  to  label  1  and  label 2,  respectively.) 
r  ( i)<— fl/T (r(J))  (read  the  rconm{j)\h  bit  of  con  (0)  (the  input)) 

CJUMP  r(j)  comp  r(k ),  label  (jump  to  instruction  labeled  label  on 
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condition  rconm{j)  comp  rconm(k),  where  the 

arithmetic  comparison  comp  e  [<,  <,  =,  >,  >,  *  }), 
r(i)+-r(j)  O  r(k )  for  O  €  {+,  bool  } 

(addition,  subtraction,  bitwise  Boolean  operations) 

Processor  P$  can  perform  a  FORK  operation  only  once.  This  restriction  is  necessary  to 
prevent  the  activation  of  multiple  processors  with  identical  processor  numbers.  This  is  also 
the  reason  why  Pn  halts  when  it  performs  a  FORK. 

In  the  definition  of  the  FORK  instruction  given  by  Fortune  and  Wyllie  (1978),  the 
processor  executing  a  FORK  remains  active  and  activates  the  lowest  numbered  inactive 
processor. 

For  an  integer  d,  define  len  (d)  as  the  minimum  integer  w  such  that 
-2w~l  <d<  2W_1-1.  Thus,  d  has  a  two’s  complement  representation  with  w  bits.  Let  vv  = 
ma \{len  ( rconm(J )),  len  ( rconm{k ))}.  To  perform  a  Boolean  operation  on  rconm(J)  and 
rconm(k),  the  PRAM  performs  the  operation  bitwise  on  the  w-bit  two’s  complement 
representations  of  rconn(J)  and  rconm(k).  The  PRAM  interprets  the  resulting  integer  x  in 
w-bit  two’s  complement  representation  and  writes  x  in  rm{i).  We  need  at  least  w  bits  so  that 
the  result  will  correctly  be  positive  or  negative. 

Let  T  (respectively,  1)  denote  the  unrestricted  left  (respectively,  right)  shift  operation: 
the  instruction  r(/)«-r(y)  T  r(k)  (respectively,  r  (/')<-/•(/')  i  r(k))  places  rconm(J)- 2rcon~{k) 
(respectively,  rconm(j)  +  2'’con"(i))  into  rm(i).  The  instruction  can  also  be  viewed  as  placing 
into  rm{i)  the  result  of  shifting  the  binary  integer  rconm(j)  to  the  left  (respectively,  right)  by 
rccnm(k)  bit  positions. 

Let  prod-PRAM  denote  the  class  of  probabilistic  PRAMs  in  which  each  processor  is 
allowed  to  uniformly  choose  one  of  a  constant  size  multiset  of  alternatives  at  each  step. 
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Instructions  have  the  following  form: 

r(i)  O  r(k);  ctj,  ct2,  ....  ag 

in  which  aly  a2,  ....  are  instruction  labels;  the  processor  executes  r{i)  <—  r(J)  O  r(k), 
then  uniformly  selects  one  of  (cti ,  a2,  ....  a# }  as  the  next  instruction. 

In  some  variants  of  the  PRAM  model,  the  input  is  initially  located  in  the  first  n  cells, 
one  bit  per  cell.  We  therefore  have  the  instruction  “r  (/)«— fi/T(r  (J))"  in  order  to  remove 
the  effects  of  a  different  input  convention.  This  instruction  was  also  used  by  Reischuk 
(1987). 

At  each  step,  each  active  processor  simultaneously  executes  the  instruction  indicated  by 
its  program  counter  in  one  unit  of  time,  then  increments  its  program  counter  by  one,  unless 
the  instruction  causes  a  jump.  On  an  attempt  to  read  a  cell  at  a  negative  address,  the 
processor  reads  the  value  0;  on  an  attempt  to  write  a  cell  at  a  negative  address,  the  processor 
does  nothing. 

The  assumption  of  unit  time  instruction  execution  is  an  essential  part  of  our  definition. 
In  a  sense,  our  work  is  a  study  of  the  effects  of  this  unit-cost  hypothesis  on  the  computational 
power  of  time-bounded  PRAMs  as  the  instruction  set  is  varied. 

For  ease  of  description,  we  sometimes  allow  a  PRAM  a  small  constant  number  of 
separate  memories,  which  can  be  interleaved.  This  allowance  entails  no  loss  of  generality 
and  only  a  constant  factor  time  loss. 

Lemma  3.1.  For  all  T ( n ),  every  language  recognized  in  time  T(n)  by  a  PRAM  R  with  q 
separate  shared  memories,  q  a  constant,  can  be  recognized  in  time  O  ( T  (n))  by  a  PRAM  Z 
with  one  shared  memory. 
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Proof.  R  has  q  separate  shared  memories:  memo,  •  •  • ,  memq_x .  Let  cb(k)  denote  the  Xrth 
cell  of  memb  and  conb(k)  the  contents  of  that  cell. 

Z  maps  cb(k)  of  R  to  c  ( qk  +  b).  Thus,  to  simulate  an  access  to  cb(k),  Z  computes 
qk  +  b  in  constant  time  in  its  local  memory,  then  accesses  c  (qk  +  b).  In  this  manner,  Z 
simulates  each  step  of  R  in  constant  time.  □ 

In  Chapter  10,  we  discuss  the  effects  of  variations  in  the  definition  of  the  basic  PRAM 
on  our  results.  In  particular,  we  look  at  the  concurrent  read  and  write  capabilities,  write 
conflict  resolution  scheme,  FORK  operation,  and  size  of  local  memory. 

In  the  simulations  to  follow,  the  simulating  PRAM  activates  primary  and  secondary 
processors.  The  Activation  Lemma  tells  how  the  PRAM  activates  them  and  how  their 
processor  numbers  are  related. 

Activation  Lemma.  A  PRAM  R'  can  activate  p  primary  processors,  each  with  s  secondary 
processors,  in  O  (log  p  +  log  s)  steps. 

Proof.  In  O  (log  p)  steps,  R '  activates  p  primary  processors.  By  definition  of  the  FORK 
command,  these  processors  are  numbered  p,  p + 1 ,  •  •  •  ,2p  -1 .  In  O  (log  s )  steps,  each  of 
these  processors  activates  s  secondary  processors.  When  each  primary  processor  Pg  FORKs, 
it  sets  the  program  counter  of  P  ig  to  indicate  that  it  is  a  primary  processor  and  the  program 
counter  of  P  2g+i  to  indicate  that  it  is  a  secondary  processor.  When  each  secondary 
processor  Ph  FORKs,  it  sets  the  program  counters  of  Pjh  and  P2/1+1  to  indicate  that  they  are 
secondary  processors.  After  the  O  (log  s)  steps,  the  primary  and  secondary  processors  are 
numbered  ps,ps+l,  •  •  • ,  2ps-l.  Processors  numbered  js,p  £  j  £  2p-l,  are  primary 
processors.  Processors  numbered  js+k,  0  £  k  <  r-1,  are  the  secondary  processors  belonging 
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to  P)S.  Each  primary  processor  is  also  considered  as  a  secondary  processor  belonging  to 
itself.  Primary  processor  Pjs  corresponds  to  processor  Pj-p  of  the  simulated  machine  R.  □ 

Let  R  be  a  PRAM[*].  By  repeated  application  of  the  multiplication  instruction,  R  can 
generate  integers  of  length  O  (n22  (n))  in  T(/t)  steps.  By  indirect  addressing,  processors  in  R 
can  access  cells  with  addresses  up  to  2n2,n "*  in  T(n)  steps,  though  R  can  access  at  most 
O  (P  ( n)T ( n ))  different  cells  during  its  computation.  In  subsequent  chapters,  these  cell 
addresses  will  be  too  long  for  the  PRAM  and  TM  simulators  to  write.  Therefore,  we  first 
construct  a  PRAM[*]  R'  that  simulates  R  and  uses  only  short  addresses.  Similarly,  a 
PRAM[T,1]  can  generate  extremely  long  integers  and  use  them  as  indirect  addresses,  so  we 
simulate  this  by  a  PRAM[T,i]  that  uses  only  short  addresses. 

Associative  Memory  Lemma.  Let  op  c  {*,  f,  -M-  For  all  T  in)  and  P  ( n ),  every  hnguage 
recognized  with  P  ( n )  processors  in  time  T(n)  by  a  PRAM[op]  R  can  be  recognized  in  time 
O  ( T ( n ))  by  a  PRAM[op]  R'  that  uses  O  ( P2(n)T ( n ))  processors  and  accesses  only  cells  with 
addresses  in  0,  ■  ■  •  ,0(P{n)T  (n)). 

Proof.  Let  R  be  an  arbitrary  PRAMfop]  with  time  bound  T  ( n )  and  processor  bound  Pin). 
We  construct  a  PRAM[op]  R'  that  simulates  R  in  time  O  (T (n))  with  P2{n)T in)  processors, 
but  accesses  only  cells  with  addresses  in  0,  ...,0  iP  in)T («)).  R '  employs  seven  separate 
shared  memories:  mem i ,  •  •  • ,  mem-).  R '  uses  memories  mem  i  and  mem2  to  simulate  the 
shared  memory  of  /?;  memories  mem  j,  mem  4,  and  mem  5  to  simulate  the  local  registers  of  R: 
and  memories  mem6  and  mem 7  for  communication  among  the  processors.  Let  cb(k)  denote 
the  £th  cell  of  memb  and  conb(k)  the  contents  of  that  cell.  R'  organizes  the  cells  of  mem  \ 
and  mem 2  in  pairs  to  simulate  the  memory  of  R:  the  first  component,  c  1  fit),  holds  the 
address  of  a  cell  in  R\  the  second  component,  c  2ik),  holds  the  contents  of  that  cell.  Actually, 
in  order  to  distinguish  address  0  from  an  unused  cell,  c  \  ( k )  holds  one  plus  the  address.  Let 
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pair  (k)  denote  the  kxh  memory  pair.  R '  organizes  the  cells  of  mem  3,  mem 4,  and  mem  5  in 
triples  to  simulate  the  local  registers  of  R:  the  first  component,  c-s(k),  holds  the  processor 
number;  the  second  component,  Cn(k),  holds  the  address  of  a  register  in  R\  the  third 
component,  c$(k),  holds  the  contents  of  that  register.  Let  triple  (k)  denote  the  fcth  memory 
triple.  Since  R  can  access  at  most  0  ( P  ( n)T ( n ))  cells  in  T(n)  steps,  R '  can  simulate  the  cells 
used  by  R  with  O  ( P  ( n)T ( n ))  memory  pairs  and  triples. 

Let  Pm  denote  processor  number  m  of  /?;  let  P'm  denote  processor  number  m  of  R '. 

We  now  describe  the  operation  of  R In  O  (log  P  ( n ))  steps,  R'  activates  P  ( n ) 
processors,  called  primary  processors.  In  the  next  logCP  (n)T  (n))  steps,  each  primary 
processor  activates  P  ( n)T (n)  secondary  processors,  each  of  which  corresponds  to  a  memory 
pair  and  a  memory  triple. 

By  the  Activation  Lemma,  primary  processor  P’m  corresponds  to  the  processor  of  R 
numbered  m  /  P(n)T{n).  The  processors  numbered  m+k,  for  all  k,  0  <  k  £P(n)T(n)-l, 
will  be  the  secondary  processors  belonging  to  primary  processor  P'm.  So,  if  i  <  m,  then  all 
secondary  processors  belonging  to  P\  are  numbered  lower  than  all  secondary  processors 
belonging  to  P'm.  We  exploit  this  ordering  to  handle  concurrent  writes  by  processors  in  R. 

When  the  secondary  processors  of  P'm  are  not  otherwise  occupied,  they  concurrently 
read  C(,(m)  at  each  time  step,  waiting  for  a  signal  from  P'm  to  indicate  their  next  tasks.  R  ’ 
applies  its  concurrent  read  capabilities  in  this  way  to  implement  a  constant  time  broadcast 
from  a  primary  processor  to  all  of  its  secondary  processors. 

Each  secondary  processor  assigns  itself  to  the  memory  pairs  and  triples  as  follows. 

Each  secondary  processor  P]  belonging  to  P'm,  j  =  m  +k,  handles  pair(k)  and  triple  (k).  We 
call  k  the  assignment  number  of  Pj.  P'j  computes  its  assignment  number  in  constant  time. 
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Suppose  R '  is  simulating  step  t  of  R  in  which  Pg  writes  v  in  c  (/).  Then  the 
corresponding  primary  processor  P'moiR'  writes /  +1  in  c  i (P  (n)(T ( n)-t )  +  g  +  1 )  and  v  in 
c 2(P  ( n)(T ( n)-t )  +  g  +  1).  That  is,  at  step  t  of  /?,  all  primary  processors  of  R '  write  only 
cells  with  addresses  in  P  ( n)(T (n)-r)+l>  *  •  • ,  P  (n)(T (n)-t +1)  with  the  lowest  numbered 
primary  processor  writing  in  the  lowest  numbered  cell  in  the  block.  The  memory  holds  a 
copy  for  every  time  a  processor  attempts  to  write  c  (f ).  Figure  3.1  is  an  example,  for  P  (n )  = 
4  and  T(n)  =  3,  showing  which  primary  processors  write  in  which  cells  of  mem  ^  and  mem  2 
at  each  step  of  R  and  the  assignment  numbers  of  the  secondary  processors  assigned  to  those 
cells.  By  this  ordering,  the  copy  of  a  cell  in  R  with  the  current  contents  (most  recently 
written  by  lowest  numbered  processor)  is  handled  by  the  lowest  numbered  secondary 
processor.  If  at  some  later  step  a  primary  processor  P'm  desires  to  read  con  (f)ofR,  then  its 
secondary  processors  read  all  copies  of  con  ( / )  and  concurrently  write  their  values  in  Ci(m). 
By  the  write  priority  rules,  the  secondary  processor  reading  the  current  value  of  con(f)  will 
succeed  in  the  write. 
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Figure  3.1.  Memory  allocation:  T(n)  =  3,P{n)  =  4 
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Similarly,  suppose  R'  is  simulating  a  step  of  R  in  which  Pg  writes  v  in  rg(f).  Then  P'm 
writes g  in  c i(P (n)(T (n)-t)  +  g  +  1) ,/  +  1  in  c 4(P (n)(T(n)-t)  +  g  +  1),  and  v  in 
c5 (P  (n)(T(n)-t)  +  g  +  1).  If  at  some  later  step  P'm  desires  to  read  rcong{f),  then  its 
secondary  processors  read  all  copies  of  rcong{f)  and  concurrently  write  their  values  in 
c7(m). 

When  a  processor  Pg  of  R  executes  an  instruction  r  ( i)*—r  (J)  O  r  (k),  it  reads  rcong(J) 
and  rcong(k),  computes  v  :=  rcong(J )  O  rcong(k),  and  writes  v  in  rg(i).  The  corresponding 
processor  P'm  of  R '  simulates  this  step  as  follows.  If  j  is  negative,  then  P'm  writes  0  in  c6(m) 
and  c7(m).  Otherwise,  P'm  writes  g  in  c7(m)  to  indicate  that  it  wishes  to  read  the  contents  of 
a  register  of  Pg  and  writes  0  in  c6(m)  to  clear  it.  Each  secondary  processor  of  P'm  reads 
c7(m)  and  compares  con-i(m)  with  the  value  of  the  first  component  of  its  assigned  memory 
triple.  P'm  writes  j  +  1  in  c7(m)  to  specify  the  address  of  the  register  it  wishes  to  read.  Each 
secondary  processor  of  P'n  that  matches  g  reads  c7(m)  and  compares  con-i(m)  with  the 
second  component  of  its  assigned  memory  triple.  If  the  contents  match  for  secondary 
processor  P’r  which  is  assigned  triple ( k ),  then  P)  writes  con5(k)  in  c7(m)  and  1  in  C(,{m)  to 
inform  P'm  that  it  has  found  the  desired  register  contents.  P'm  reads  c6(m).  If  con6(m)  =  0, 
then  no  secondary  processor  matched  the  address;  hence,  c  ( j )  is  a  cell  of  R  that  has  not  yet 
been  written,  and  P'm  writes  0  in  rm(l).  If  con6(m)  =  1,  then  some  secondary  processor 
matched  the  address,  and  P'm  copies  con-j(m)  to  rm(l).  Next,  P'm  writes  0  in  c6(m)  and 
c7(m)  and  repeats  the  process  for  k,  except  that  P'm  writes  rm( 2)  instead  of  rm(l).  (Note: 
Handling  indirect  addresses  requires  two  cycles  through  the  above  steps.)  P’m  then  computes 
v  :=  rconm{  1)  O  rconm( 2),  writing  v  in  rm(  1).  Next,  if  i  is  negative,  then  P'm  does  nothing. 
Otherwise,  suppose  R'  is  simulating  step  t  of  R.  Each  primary  processor  keeps  track  of  r  in 
its  local  memory.  Then  P'm  writes  g  in  c -$(P  (n)(T (n)-t)  +  g  +  1),  i+1  in 
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c4(P  ( n)(T ( n)-t )  +  g  +  1),  and  v  in  c $(P  (n)(T (n)-i)  +  g  +  1)  to  complete  the  simulation  of 
step  r. 

Thus,  R  ’  uses  a  constant  number  of  steps  to  simulate  a  step  of  R  and  only 
O  (log  P ( n)T ( n ))  initialization  time.  Since  P(n)  £  R'  uses  0(T ( n ))  steps  to  simulate 
T  ( n )  steps  of  R.  □ 

Observation  1:  R"  needs  only  addition  and  subtraction  to  construct  any  address  that  it 

uses. 

Observation  2:  Each  processor  of  R '  uses  only  a  constant  number  of  local  registers. 
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Chapter  4.  Multiplication 

In  this  chapter,  we  simulate  a  time-bounded  PRAM[*]  by  four  different  models  of 
computation:  basic  PRAM,  unbounded  fan-in  circuit,  bounded  fan-in  circuit,  and  Turing 
machine.  We  establish  that  polynomial  time  on  a  PRAMf*]  or  a  PRAM  and  polynomial 
depth  on  a  bounded  or  unbounded  fan-in  circuit  and  polynomial  space  on  a  TM  are  all 
equivalent. 

4.1.  Simulation  of  PRAM[*]  by  PRAM 

Let  R  be  a  PRAM[*]  operating  in  time  T(n)  on  inputs  of  length  n  and  using  at  most 
P  (n)  processors.  Let  R '  be  a  PRAM[*]  that  uses  only  short  addresses  and  simulates  R 
according  to  the  Associative  Memory  Lemma.  Thus,  R'  uses  O  (P2(n)  T(n))  processors, 

0(T(n))  time,  and  only  addresses  in  0,  I . O  (P  (n )  T (n )).  Each  processor  of  R '  uses  only 

q  registers,  where  q  is  a  constant. 

We  construct  a  PRAM  Z  that  simulates  R  via  R'  in  O  ( T2(n )  /  log  T  (n))  time,  using 
0{P2{n)T2{n)  n2  4r(n)  log  T (n))  processors.  We  view  Z  as  having  <7  +  4  separate  shared 
memories:  memo, ....  memq+ 3.  Our  view  facilitates  description  of  the  algorithm  to  follow. 
The  idea  of  the  proof  is  that  Z  stores  the  cell  contents  of  R '  with  one  bit  per  cell  and  acts  as 
an  unbounded  fan-in  circuit  to  manipulate  the  bits. 

Initialization.  Z  partitions  memq  into  O  (P  ( n)T (n))  sections  of  n2T(n)  cells  each.  Let 
S  ( i )  denote  the  zth  section.  A  section  is  sufficiently  long  to  hold  any  number  generated  in 
T (n)  steps  by  R' ,  one  bit  per  cell,  in  n2r(n*-bit  two’s  complement  representation.  Section 
S  ( i )  contains  con  ( i )  of  R '  with  one  bit  of  con  ( i )  in  each  of  the  first  len  ( con  (i ))  cells  of  the 
section.  R '  writes  the  more  significant  bits  in  cells  with  larger  addresses. 
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Z  partitions  each  of  memo . memq~\  into  0(P2(n)T(n))  blocks  of 

n.2T('n)-T ( n )  log  T (n)  sections  each.  Let  S,(m)  denote  the  mth  block  of  mem,.  A  block  is 
large  enough  to  implement  the  multiplication  algorithm  of  Schonhage  and  Strassen  (1971). 
The  first  section  of  fi((m)  contains  rconm(i )  of  R"  with  one  bit  of  rconm(i)  in  each  of  the  first 
len  (rconm{i))  cells  of  the  section. 

Z  activates  O  (P2(n)  T(n))  primary  processors,  one  for  each  processor  of  R' ,  in  time 
0  (log  P  ( n)T (n)).  Z  must  quickly  access  individual  cells  in  each  block,  so  each  primary 
processor  activates  O  (n2  4r(n)  T ( n )  log  T («))  secondary  processors  in  O  ( T (n))  time 
(Activation  Lemma).  For  primary  processor  Pm,  secondary  processor  P ,,  j  e  (m,  .., 
m  +n2T^-\)  ,  assigns  itself  to  the  0'-m)th  cell  of  the  first  section  of  a  block.  These 
processors  will  handle  comparisons. 

Secondary  processors  belonging  to  Po  now  construct  a  set  of  values  to  be  used  in  the 
SQUASH  procedure,  to  be  defined  later,  which  handles  indirect  addressing.  The  secondary 
processors  for  the  first  block  set  conq+\  (t)  =  2‘,  for  all  i,  0  <,  i  <,  log  P  ( n)T ( n ). 

In  memq+2 .  Z  builds  an  address  lookup  table  containing  the  address  of  the  first  cell  of 
Bt(m )  tn  cell  m  of  the  table,  0 1  m  <  P  {n  )T (n).  In  all  memories,  the  mth  block  begins  at  the 
same  address.  These  addresses  range  up  to  n2AT(n^P  ( n)T2(n )  log  T  (n),  so  Z  creates  the 
table  in  O  (T (n))  time. 

Z  next  spreads  the  input  integer  over  the  first  n  cells  of  5  (0)  of  memq,  that  is,  Z  places 
the  yth  bit  of  the  input  word  in  the  y'th  cell  of  S  (0).  This  process  takes  constant  time  for 
processors  P  o,  •  •  • ,  Pn-\ ,  each  performing  the  BIT  instruction  indexed  by  their  processor 
number.  (Note  that  without  the  r(k)i— BIT(r(i))  instruction,  where  rconm(i)  =  j,  this  process 
would  take  time  O(n)  because  for  each  j,  l  £  j  £  n,Z  would  have  to  construct  a  mask  with  a 


t 


29 


1  in  the  y'th  position  and  0’s  in  all  other  positions  to  determine  the  value  of  the  yth  input  bit. 

If  T(n)  =  o  {n),  then  O  ( n )  time  is  unacceptably  high.) 

Simulation.  We  are  now  prepared  to  describe  the  simulation  by  Z  of  a  general  step  of 
R'.  Consider  a  processor  Pg  of  R’  and  the  corresponding  primary  processor  Pm  of  Z.  The 
actions  of  Pm  and  its  secondary  processors  depend  on  the  instruction  executed  by  Pg  of  R' . 
Pm  notifies  its  secondary  processors  of  the  instruction.  The  following  cases  arise. 

r(i)*—r(j)  +  r(k):  Chandra  et  al.  (1985)  gave  an  unbounded  fan-in  circuit  of  size 
O  (x(log*;E)2)  and  constant  depth  for  adding  two  integers  of  length  x.  Stockmeyer  and 
Vishkin  (1984)  proved  that  an  unbounded  fan-in  circuit  of  depth  D  ( n )  and  size  5  ( n )  can  be 
simulated  by  a  CRCW  PRAM  in  time  O  ( D  («))  with  O  ( n  +S  (n))  processors  (Theorem  2.2). 
By  the  combination  of  these  two  results,  the  secondary  processors  perform  addition  in 
constant  time  with  their  concurrent  write  ability.  This  addition  requires 
O  (n2r(n)(log*(n2r(n)))2)  processors. 

r  (i )«—  r  (J)  A  r  (k):  The  secondary  processors  perform  a  Boolean  AND  in  one  step. 
Other  Boolean  operations  are  performed  analogously. 

r  ( t )«— r  (J)  -  r  (k):  The  secondary  processors  add  rcong(J)  and  the  two’s  complement 
of  rcong(k).  This  takes  constant  time. 

Comparisons  ( CJUMP  r(i)  >  r{j),  label)-  For  1  <,  k  <  n  2T(n\  the  secondary 
processor  of  the  first  section  that  normally  handles  the  kth  cell  of  the  section  handles  the 
( n  2r(n)-fc+l)th  cell.  Thus  the  lowest  numbered  processor  reads  the  most  significant  bit.  Pm 
writes  a  0  in  cq+i(m).  All  secondary  processors  read  the  n2r(,,>th  cell  in  5,0?)  and  Bj(g)  to 
determine  whether  rcong(i )  and  rcong(j)  are  negative.  If  rcong(i)  ( rcong(J ))  is  negative  and 
the  other  is  not,  then  Pm  writes  2  (1)  in  c?+3(/n).  Otherwise,  each  secondary  processor 
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allocated  to  the  first  section  compares  corresponding  bits  of  B;(g)  and  Bj{g).  If  both  rcong(i ) 
and  rcong(j)  are  nonnegative,  then  if  the  bits  are  equal,  it  does  nothing;  if  the  bit  of  B,(g)  is 
greater,  it  writes  a  1  in  cq+ 3(m);  if  the  bit  of  Bj{g)  is  greater,  it  writes  a  2  in  cq+j(m).  If  both 
rcong(i )  and  rcong(J)  are  negative,  then  if  the  bits  are  equal,  it  does  nothing;  if  the  bit  of 
Bt(g)  is  greater,  it  writes  a  2  in  cq+^(m)\  if  the  bit  of  Bjig)  is  greater,  it  writes  a  1  in  cq+2(m). 
After  this  step,  if  rcong(i )  =  rcong(J),  then  conq+3(m)  =  0;  if  rcong(i)  *  rcong(J),  then 
cq+3(m)  holds  the  value  written  by  the  lowest  numbered  secondary  processor  to  write.  If 
conq+2(m)  =  1,  then  the  comparison  is  true;  otherwise,  the  comparison  is  false.  This  process 
works  by  the  concurrent  write  rules  of  the  PRAM.  Other  comparisons  are  performed 
analogously  and  all  comparisons  can  be  simulated  in  constant  time. 

r  ( i *  r(k):  We  use  the  following  two  lemmas. 

Lemma  4.1.1.  (Schonhage  and  Strassen,  1971)  A  log-space  uniform,  bounded  fan-in  circuit 
of  depth  O  (log  y)  and  size  O  (y  log  y  loglog  y)  can  compute  the  product  of  two  operands  of 
length  y. 

Proof.  For  inputs  of  length  y,  Schonhage  and  Strassen  (1971)  gave  a  multiplication 
algorithm  that  may  be  implemented  as  a  bounded  fan-in  circuit  with  depth  O  (log  y )  and  size 
0(y  logy  loglog y).  □ 

Lemma  4.1.2.  A  log-space  uniform,  unbounded  fan-in  circuit  of  depth  O  (log  y  /  loglog  y) 
and  size  O  (y  2  log  y  loglog  y)  can  compute  the  product  of  two  operands  of  length  y. 

Proof.  Chandra  et  al.  (1984)  proved  that  for  any  e  >  0,  a  bounded  fan-in  circuit  of  depth 
D  (y)  and  size  S(y)  can  be  simulated  by  an  unbounded  fan-in  circuit  of  depth 
O  (D  (y )  /  e  loglog  y )  and  size  O  (2(Iog  S(y )).  Combining  Lemma  4.1.1  with  this  result  and 
setting  e  =  1,  we  establish  the  lemma.  □ 
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R'  can  generate  numbers  of  length  up  to  n2T(n\  By  Theorem  2.2  and  Lemma  4.1.2,  a 
CRCW  PRAM  can  simulate  a  bounded  fan-in  circuit  performing  multiplication  in  time 
0(T(n)  /  log  T («))  with  O  (n2  4T ^  T (n)  log  T ( n ))  processors. 

Indirect  addressing:  By  the  Associative  Memory  Lemma,  R '  accesses  only  addresses 
of  length  O  (log  P  (n)T ( n )).  If  Pg  wishes  to  perform  an  indirect  read  from  c  (r  (/)),  then  Pm 
and  its  associated  processors  perform  a  SQUASH  on  B,(g)  in  time  O  (loglog  P  ( n)T ( n )).  The 
goal  of  SQUASH  is  to  place  in  a  single  cell  the  integer  whose  two’s  complement 
representation  is  stored  in  a  section  with  one  bit  per  cell.  In  a  SQUASH,  the  secondary 
processors  associated  with  the  contents  of  the  first  O  (log  P  ( n)T  (n))  cells  of  a  block  read 
their  cells.  If  the  kxh  cell  contains  a  0,  then  the  £th  processor  sets  :=  0.  If  the  £th  cell 
contains  a  1,  then  the  kth  processor  sets  :=  2*.  (Recall  that  2k  was  previously  computed 

log  P(n)T (n) 

during  the  initialization  period.)  Next,  the  secondary  processors  compute  £  x *  in 

k= 0 

O  (loglog  P  ( n)T ( n ))  =  O  (log  T ( n ))  time  since  P(n)  <  2T(n).  SQUASH  places  a  number, 
which  was  stored  one  bit  per  cell  in  the  first  O  (log  P  ( n)T ( n ))  cells  of  a  block,  into  a  single 
cell. 

If  processors  Pf  and  Pg  of  R'  wish  to  simultaneously  write  c(y),  then  the  corresponding 
processors  Pi  and  /  m  of  Z  will  simultaneously  attempt  to  write  S  (J)  of  memq.  If /<  g,  then 
l  <  m,  and  all  secondary  processors  of  Pi  are  numbered  less  than  all  secondary  processors  of 
Pm.  Thus,  in  R' ,  Pf  succeeds  in  its  write,  and  in  Z,  Pi  and  its  secondary  processors  succeed 
in  their  writes. 


Theorem  4.1.  For  all  T(n)  >  log  n,  PRAM  [*  ]-TIME(T ( n ))  c 
PRAM -TIM E(T2(n)  /  log  T(n)). 
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Proof.  According  to  the  above  discussion,  Z  simulates  R  via  R Initialization  takes 
O  (log(P  ( n)T ( n))  +  T ( n )  +  log  n)  =  0(T (n))  time.  Z  performs  indirect  addressing  in 
O  (log  T ( n ))  time,  multiplication  in  O  (T(n)  /  log  T (n))  time,  and  all  other  operations  in 
constant  time.  Thus,  Z  uses  time  0(T(n)  /  log  T ( n ))  to  simulate  each  step  of  R Z  uses 
O  ( P2(n)T ( n ))  primary  processors,  each  with  0  (n2  4r(n)  T ( n )  log  T ( n ))  secondary 
processors.  Hence,  Z  simulates  /?  in  0  (T2(n)  /  log  7 («))  time,  using 
0(P2(n)T2(n)  n 2  4T(n)  log  T  (n))  processors.  □ 

Corollary  4.1.1.  PRAM  [*  }-PTIME  =  PRAM-PTIME. 

Corollary  4.1.2.  PRAM  [*  )-POLYLOGTlME  =  PRAM  -POLYLOGTIME. 

lfT(n)  =  0  (log  n ),  then  P  (n)  is  a  polynomial  in  n,  and  Z  simulates  R  in  time 
O  (log 2n  /  loglog  n)  with  polynomially  many  processors.  Thus,  an  algorithm  running  in  time 
O  (log  n)ona  PRAM[*]  is  in  NC2.  lfT(n)  =  0  (log *n),  then  Z  simulates  R  in  time 
0(\oglkn  /  (2k  loglog  «))  with  Oin2^0^  '"-log^n  loglog  n)  processors.  So,  our  simulation 
does  not  show  that  an  algorithm  running  in  time  O  (log k>  1,  on  a  PRAM[*]  is  in  NC 
because  of  the  superpolynomial  processor  count.  An  interesting  open  problem  is  to  show 
either  that  PRAM  [*  ] -POLYLOGTIME  =  NC  by  reducing  the  processor  count  to  a 
polynomial  or  that  NC  is  strictly  included  in  PRAM  [*  ]-POLYLOGTIME  by  proving  that  the 
simulation  requires  a  superpolynomial  number  of  processors. 

4.2.  Simulations  of  PRAMf*]  by  Circuits  and  Turing  Machine 

We  now  describe  simulations  of  a  PRAMf*]  by  a  log-space  uniform  family  of 
unbounded  fan-in  circuits,  a  log-space  uniform  family  of  bounded  fan-in  circuits,  and  a 
Turing  machine. 
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Lemma  4.2.1.  For  each  n,  every  language  recognized  by  a  PRAM[*]  R  in  time  T (n)  with 
P(n)  processors  can  be  recognized  by  a  log-space  uniform,  unbounded  fan-in  circuit  UCn  of 
depth  0(T2(n )  /  log  T ( n ))  and  size  0(n2  T2(n)  8r(,,)  log  T(n)). 

Proof.  The  depth  bound  follows  from  Theorems  4.1  and  2.1.  We  now  establish  the  size 
bound.  Let  R'  be  the  PRAM[*]  described  in  Theorem  4.1  that  simulates  R  according  to  the 
Associative  Memory  Lemma,  using  O  (T ( n ))  time  with  0(P  (n)T{n))  processors.  Fix  an 
input  length  n.  Let  UCn  be  a  log-space  uniform,  unbounded  fan-in  circuit  that  simulates  R' 
by  the  construction  given  by  Stockmeyer  and  Vishkin  (1984)  (Theorem  2.1),  with  one 
modification.  For  each  time  step  of  R\  we  add  to  UCn  a  block  of  depth  O  ( T (n)  /  log  T (n)) 
and  size  O  (n2  4r(n)  T ( n )  log  T (/»))  that  handles  multiplication  (Lemma  4.1.2).  Thus,  UCn 
has  depth  O  ( T2(n )  /  log  T («))  and  size  O  ( P  ( n)T (rt)U ( n)(n+T ( n ))  +  ( n+T (n))3 
+  {n+T(n))(n+P  (n)T(n))  +  n2  4r(">  T(n)  log  T(n)])  =0(n2T2(n)  8r(n)  log  T(n)),  since 
P(n)<2Tw.  □ 

Lemma  4.2.2.  For  each  n,  every  language  recognized  by  a  PRAM[*]  R  in  time  T  ( n )  with 
P  (n)  processors  can  be  recognized  by  a  log-space  uniform,  bounded  fan-in  circuit  BCn  of 
depth  0(T2(n))  and  size  0(n2  T2(n)  8r(,l)  log  T ( n )). 

Proof.  Fix  an  input  length  n.  Let  UCn  be  the  unbounded  fan-in  circuit  described  in  Lemma 
4.2. 1  that  simulates  R.  Except  for  the  circuit  blocks  implementing  multiplication,  the 
portions  of  the  circuit  that  simulate  a  single  time  step  of  S'  have  constant  depth  and  fan-in  at 
most  O  (P  2(n)  T (n)).  Hence,  these  parts  of  the  circuit  can  be  implemented  as  a  bounded 
fan-in  circuit  of  depth  O  (log  P(n)  +  logT («)).  The  multiplication  blocks  may  be 
implemented  as  bounded  fan-in  circuits  of  depth  O  (T (n))  (Lemma  4.1.1).  Let  BCn  be  this 
bounded  fan-in  implementation  of  UCn.  Since  P  ( n )  £  2r(n\  BCn  simulates  each  step  of  S' 
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in  depth  O  (T (n)),  and  hence  BCn  simulates  S  via  S'  in  depth  O  ( T2(n ))  and  size 
0(n2  T2{n)  8r(n)  log  Tin)).  □ 

Theorem  4.2.  For  all  T  ( n )  £  log  n,  PRAM  [*  ]-TIME  (T  in))  c  DSP  ACE  ( T2(n )). 

Proof.  Theorem  4.2  follows  from  Lemma  4.2.2  and  Borodin’s  (1977)  result  that  a  log-space 
uniform,  bounded  fan-in  circuit  of  depth  D  (n)  can  be  simulated  in  space  O  ( D  ( n ))  on  a 
Turing  machine.  □ 

Corollary  4.2.1.  PRAM  [*  ]  -PTIME  =  PSP  ACE. 

4.3.  Direct  Simulation  of  PRAM[*]  by  Turing  Machine 

For  the  sake  of  completeness,  we  describe  a  simulation  of  PRAM[*]  R  via  R '  by  a 
deterministic  Turing  machine  M  that  uses  space  T2(n).  This  is  a  direct  simulation  by  the 
TM,  rather  than  through  circuits,  as  in  Section  4.2.  M  simulates  R '  by  writing  only  one  bit  at 
a  time  of  a  cell’s  contents.  Let  A.  denote  the  empty  string. 

Let  R  be  a  PRAM[*]  operating  in  time  Tin)  on  inputs  of  length  n  and  using  at  most 
P  ( n )  processors.  Let  R '  be  a  PRAM[*]  that  accesses  only  short  addresses  and  simulates  R 
according  to  the  Associative  Memory  Lemma.  Thus,  R'  uses  O  (P2(n)  T (n))  processors, 

O  ( T ( n ))  time,  and  only  addresses  in  0,  1, ....  0{P{n)T in)).  Each  processor  of  R '  uses  only 
q  registers,  where  q  is  a  constant. 

We  construct  a  deterministic  Turing  machine  M  that  simulates  RviaR'  in  space  T2{n). 
By  the  definition  of  PRAM,  R '  accepts  input  0)  if  and  only  if,  by  the  execution  of  R '  on  (0, 
P(x  executes  an  ACCEPT  instruction.  To  check  this  condition,  M  executes  the  mutually 
recursive  procedures  PCOUNTER  (m,  r),  which  returns  the  contents  of  the  program  counter 
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of  Pm  at  time  r,  and  VALUE  {b,  i,  m,  r),  which  returns  the  6th  bit  of  rconm{i)  at  time  r,  if 
m*\,or  the  6th  bit  of  con  O')  at  time  r,  if  m  =  X.  ( VALUE  is  based  on  the  procedure  FIND 
in  Hartmanis  and  Simon  (1974).) 

PCOUNTER  (m,  t)  works  as  follows.  Let  p  be  the  value  returned  by 
PCOUNTER  (m,  t).  PCOUNTER  (m,  t)  depends  on  r,  the  value  returned  by 
PCOUNTER  (m,  r-1).  If  r  indicates  that  Pm  was  not  active  at  time  r  — 1,  then  PCOUNTER 
determines  by  calls  to  PCOUNTER  for  time  t— 1  whether  /*|_m/2j  activated  Pm  at  time  f — 1 
with  a  FORK  instruction.  If  P|_m/2j  executes  FORK  label  1,  label  2,  then  if  m  is  even, 
p  =  label  1;  otherwise,  p  =  label 2.  If  r  indicates  that  Pm  is  active  at  time  r-1  and  step  r  is 
not  a  CJUMP,  REJECT ,  or  ACCEPT  instruction,  then  p  =  r +1.  If  step  r  is 
CJUMP  r(i)  comp  r(J),  label  3,  where  comp  is  an  integer  comparison,  then  PCOUNTER 
repeatedly  calls  VALUE  for  time  r-1  to  compare  rconm(i )  and  rconm(J).  If  the  comparison 
is  true,  then  p  =  label  3;  otherwise,  p  -  r+1.  If  instruction  r  is  an  ACCEPT  {REJECT),  then 
P  -  r. 

VALUE  ( b ,  i,  m,  r)  works  as  follows.  Let  v  denote  the  output  of  VALUE  ( b ,  i,  m,  r). 
Suppose  m-L.  By  calls  to  PCOUNTER,  M  determines  whether  some  processor  wrote  c  (/') 
at  time  r-1.  If  no  processor  wrote  c(i),  then  v  =  VALUE {b,  i,  X,  r-1).  Otherwise,  suppose 
Pm  executed  instruction  c(r{k))*-r(j )  such  that  rconm{k)  =  i  and  was  the  lowest  numbered 
processor  that  wrote  c  O')  at  time  r-1.  Then  v  =  VALUE {b,  j,  m,  r-1). 

Suppose  m*X.  By  calls  to  PCOUNTER,  M  determines  whether  Pm  wrote  rm(i)  at  time 
r-1.  If  not,  then  v  =  VALUE (b,  i,  m,  r-1).  Otherwise,  suppose  Pm  executed  instruction 


instr  that  wrote  rm(i)  at  time  r-1. 
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If  instr  is  r  ( i)<—k  for  a  constant  k,  then  v  is  bit  b  of  k. 

If  instr  is  r(i)*-r(J),  then  v  =  VALUE(b,  j,  m,  f-1). 

If  instr  is  a  Boolean  operation,  such  as  r(i)*-r(j)  A  r(k),  then  v  is  the  result  of  the 
Boolean  operation  on  the  6th  bits  of  the  operands,  in  this  case,  v  = 

VALUE  (6,  j,  m,  t-l)  LVALUE  (b,  k,  m ,  r-1).  M  handles  the  other  Boolean  operations 
similarly. 

If  instr  is  an  arithmetic  instruction  rather  than  a  Boolean  instruction,  then  the  execution 
of  VALUE  is  more  complicated.  Suppose  instr  is  a  multiplication  instruction  such  as 
r(i)<r-r(J)  *  r(k).  Let  w  :=  max[len  (rconm(j)),  len(rconm(k)))  be  the  length  of  the  longer 
operand.  Then  rconm(i)  is  the  sum  of  at  most  w  partial  products,  each  of  length  at  most  2 w. 
The  value  of  bit  b  of  rconm{i)  depends  on  bit  b  of  each  partial  product  and  on  the  carry  of 
length  at  most  log  w  from  column  b- 1.  Since  w  can  be  as  large  as  2r<n),  M  cannot  write  all 
w  partial  products  in  polynomial  space.  So,  M  computes  the  sum  of  the  w  partial  products 
that  contribute  to  bit  b  as  follows.  M  computes  the  carry  from  column  6-1  by  computing  the 
sum  of  each  column  and  the  carry  from  each  column  from  right  to  left  starting  at  the 
rightmost  column.  Each  partial  product  is  the  product  of  rconm{j)  and  a  bit  of  rconm{k).  M 
uses  an  accumulator  to  keep  a  running  total  of  the  sura  of  the  partial  products  in  each 
column.  Suppose  M  is  computing  the  sum  of  the  sth  column.  M  uses  two  pointers:  one 
points  to  the  bit  of  rconm(J)  involved  in  the  jth  column  of  the  gth  partial  product,  the  other  to 
the  bit  of  rconm{k)  involved  in  the  rih  column  of  the  gth  partial  product.  Recursive  calls  to 
VALUE  return  the  values  of  these  bits.  M  computes  their  product  (Boolean  AND)  and  adds 
the  product  to  the  accumulator.  M  then  shifts  the  pointers  to  point  to  the  bits  involved  in  the 
sth  column  of  the  (g+l)st  partial  product,  finds  these  bits,  computes  their  product  and  adds  it 
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to  the  accumulator,  and  so  on  until  the  5th  column  of  all  w  terms  have  been  summed.  The 
carry  from  column  s  to  column  5  +  1  equals  the  sum  of  the  5th  column  shifted  right  by  one  bit 
position.  This  process  continues  until  the  sum  of  column  b  is  computed;  the  value  of  the  bth 
bit  of  rconm(i )  is  the  lowest-order  (rightmost)  bit  of  this  sum. 

If  the  multiplier,  rconm(k),  is  negative,  then  M  multiplies  rconm(J)  by  I  rconm(k)  I  and 
adjusts  the  sign  at  the  end. 

If  instr  is  an  addition  or  a  subtraction  instruction,  then  M’s  actions  are  similar  to,  but 
simpler  than,  its  actions  for  a  multiplication  instruction. 

If  instr  is  an  indirect  write,  such  as  r(t')<— r  (r  (_/)),  then,  by  calls  to  VALUE  for  time  t- 1, 
M  obtains  the  bits  of  rconm(J).  Then  v  =  VALUE  (b,  rconm(J),  m,  r— 1).  If  instr  is 
r  (» )<—  c  ( r  (J)),  then  v  =  VALUE  ( b ,  rconm(j),  X,  f-1).  By  the  Associative  Memory  Lemma, 
M  can  write  rconm(J)  in  O  (T ( n ))  space. 

Theorem  4.3.  For  all  T(n)  k  log  n ,  PRAM  [*  ]-TIME  (T(n))  c  DSP  ACE  (T2(n)). 

Proof.  M  simulates  R  via  R '  by  the  simulation  described  above. 

The  input  is  n  bits  long  and  the  length  of  the  contents  of  any  cell  may  at  most  double  at 
each  step,  so  the  length  of  each  cell’s  contents  can  grow  to  at  most  n2T<-n)  bits.  Hence,  M 
can  write  b,  the  bit  number,  in  0  (T  («))  space.  M  can  write  t,  the  cell  address,  in 
O  (log  P  (n)T(n))  space.  Since  P(n)  <  2T{n\  O  (log  P  (n)T (n))  =  0(T(n)).  R'  activates  at 
most  O (P2(n)T (n))  processors,  so  M  can  write  m,  the  processor  number,  in 
O  (log  (P2(n)T (n)))  =  0(T (n))  space.  M  can  write  t,  the  time  step,  in  O  (log  T (n))  space. 
Therefore,  M  can  write  the  parameters  of  VALUE  and  PCOUNTER  in  O  (T(n))  space. 

Consider  the  space  M  requires  to  execute  the  simulation.  M  writes  all  the  variables 
used  in  one  invocation  of  VALUE  or  PCOUNTER  in  space  O  ( T(n )),  the  same  space  required 
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to  write  the  parameters  of  VALUE  or  PCOUNTER.  If  an  instance  of  VALUE  or  PCOUNTER 
with  time  parameter  t  makes  a  recursive  call  to  VALUE  or  PCOUNTER ,  then  the  called 
instance  will  have  time  parameter  r-1.  Recall  that  R'  simulates  R  in  time  0  ( T ( n )),  so  the 
depth  of  recursion  is  O  (T(n)).  Hence,  with  linear  space  compression,  M  simulates  R  via  R ' 
in  space  T2(n).  □ 


Chapter  5.  Division 


In  this  chapter,  we  study  the  division  instruction.  Let  us  assume  that  the  division 
instruction  returns  the  quotient.  We  are  interested  in  the  division  instruction  for  two  reasons. 
First,  division  is  a  natural  arithmetic  operation.  Second,  Simon  (1981a)  has  shown  that  a 
RAM  with  division  and  left  shift  (T)  can  be  very  powerful.  He  proved  RAM  [T ,+)-PTIME  = 
ER,  where  ER  is  the  class  of  languages  accepted  in  time 


by  Turing  machines.  At  first  glance,  this  is  a  surprising  result:  for  a  RAM  with  left  shift  and 
right  shift  (i),  we  already  know  that  RAM  [T.i  ]-PTIME  =  PSP  ACE  (Simon  1977).  Simon 
proved  ER  £  RAM  -[T ,+]-PTIME  by  building  very  long  integers  with  the  left  shift  operation 
and  then  manipulating  them  as  both  integers  and  binary  strings.  The  division  instruction  is 
used  to  generate  a  complex  set  of  strings  representing  all  possible  TM  configurations.  (Note 
that  right  shift  cannot  replace  division  in  building  these  strings.) 

We  consider  the  power  of  division  paired  with  multiplication  rather  than  with  left  shift, 
as  well  as  the  power  of  only  division  with  our  basic  instruction  set.  From  Hartmanis  and 
Simon  (1974),  we  also  know  that  RAM  [*,  +]-PTIME  =  PSP  ACE. 

In  the  following,  let  MD  be  a  PRAM[*,+]  that  uses  T(n)  time  and  P  (n)  processors.  Let 
MD'  be  a  PRAM[*,+]  that  uses  only  short  addresses  and  simulates  MD  according  to  the 
Associative  Memory  Lemma.  Thus,  MD '  uses  0  ( P2(n)T(n ))  processors,  O  ( T(n ))  time, 
and  only  addresses  in  0,  1, ...,  0  (P  ( n)T (n)). 

We  begin  by  describing  the  simulation  of  a  PRAM[V]  by  a  PRAM.  The  idea  of  the 
proof  is  that  we  modify  the  simulation  of  a  PRAM[*]  by  a  PRAM  (Section  4.1).  Because 
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this  simulation  depends  on  the  relationship  between  circuits  and  PRAMs  (Theorem  2.2),  we 
are  interested  in  the  Boolean  circuit  complexity  of  division.  Beame  et  al.  (1986)  developed 
a  circuit  for  dividing  two  n-bit  numbers  in  depth  O  (log  n).  This  circuit,  however,  is 
polynomial-time  uniform,  and  we  need  the  stronger  condition  of  log-space  uniformity.  Reif 
(1986)  devised  a  log-space  uniform,  depth  0  (log  n  loglog  n)  division  circuit,  and  Shankar 
and  Ramachandran  (1987)  improved  the  size  bound  of  this  circuit.  We  will  need  the 
following  lemma. 

Lemma  5.1.1.  A  PRAM  can  compute  the  quotient  of  two  x  bit  operands  in  time  O  (log  jc) 
with  O  (( 1/54)  x 1+s)  processors,  for  any  8  >  0. 

Proof.  Given  in  Shankar  and  Ramachandran  (1987).  □ 

Simulation.  We  construct  a  PRAM  Z  that  simulates  MD  via  MD'  in  time  O  (T2(n)). 

We  modify  the  simulation  of  a  PRAM[*]  by  a  PRAM  (Section  4.1)  to  deal  with  division 
instructions.  By  Lemma  5.1.1  with  x-n2T(n\Z  can  perform  a  division  in  time  0(T(n)) 
with  the  available  processors  and  0  <  5  <  1. 

Theorem  5.1.  For  all  T(n)  >  log  n,  PRAM  [*,  +)-TIME  ( T(n ))  c  PRAM -TIME  ( T2(n )). 

Proof.  By  the  simulation  above,  Z  simulates  each  step  of  MD '  in  time  O  (T (n))  with 
0(P2(n)T2(n)  n2  4r(n)  log  T(n))  processors.  MD'  runs  for  0(T{n))  steps,  so Z can 
simulate  MD  via  MD '  in  O  ( T2(n ))  steps.  □ 

Corollary  5.1.1.  PRAM  [*,  +]-PTIME  =  PRAM-PTIME. 

Corollary  5.1.2.  PRAM  [*,  +]-POLYLOGTIME  =  PRAM -POLYLOGTIME. 

Next,  we  consider  the  simulation  of  a  PRAM[*,+]  by  a  Turing  machine.  We  construct  a 
TM  M  that  simulates  MD  via  MD '  in  T2(n)  log  T  ( n )  space  by  modifying  the  simulation  of  a 
PRAM[*]  by  a  TM  (Section  4.2), 
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We  will  need  the  following  lemmas. 

Lemma  5.2.1.  A  log-space  uniform,  bounded  fan-in  circuit  can  compute  the  quotient  of  two 
x  bit  operands  in  depth  O  (log  x  loglog  x)  and  size  O  ((1/S4)  a:  1+5),  for  any  5  >  0. 

Proof.  Given  in  Shankar  and  Ramachandran  (1987).  □ 

Lemma  5.2.2.  For  each  n,  every  language  recognized  by  a  PRAM[*,+]  MD  in  time  T (n) 
with  P  ( n )  processors  can  be  recognized  by  a  log-space  uniform  bounded  fan-in  circuit  DCn 
of  depth  0(T2(n)  log  T(n)). 

Proof.  Fix  an  input  length  n.  Let  BC„  be  the  bounded  fan-in  circuit  described  in  Lemma 
4.2.2  that  simulates  a  PRAM[*].  Let  DC„  be  BCn  with  additional  circuit  blocks  for  division. 
To  handle  division  instructions  with  operands  of  length  at  most  x  =  n  2r(n\  we  use  the  log- 
space  uniform  O  (log x  loglog  x)  depth  bounded  fan-in  division  circuit  specified  in  Lemma 
5.2.1.  Circuit  DCn  is  at  most  at  constant  factor  larger  in  size  than  BCn.  Hence,  DCn  uses 
depth  O  (T(n)  log  T  (n))  to  simulate  each  step  of  MD.  □ 

Theorem  5.2.  For  all  T(n)  >  log  n,  PRAM[*,+]-TIME(T(n))  ^DSPACE  (T2(n)  log  T(n)). 

Proof.  Theorem  5.2  follows  from  Lemma  5.2.2  and  Borodin’s  (1977)  result  that  a  bounded 
fan-in  circuit  of  depth  D  (n)  can  be  simulated  in  space  O  ( D  ( n ))  on  a  Turing  machine.  □ 

Through  Theorem  5.2  and  the  simulation  of  DSP  ACE  ( T  (n))  in  PRAM  -TIME  ( T  (n)) 
(Fortune  and  Wyllie,  1978),  we  can  obtain  an  0  ( T2{n )  log  T (n))  time  simulation  of  a 
PRAM[*,+]  by  a  PRAM.  We  remove  the  log  T (n)  factor  by  the  direct  simulation  in 
Theorem  5.1. 

Through  Theorem  5.1  and  the  simulation  of  PRAM -TIME  (T  ( n ))  in  DSP  ACE  (T2{n)) 
(Fortune  and  Wyllie,  1978),  we  obtain  an  0  (T*(n))  space  simulation  of  a  PRAM[*,-f-]  by  a 
TM.  The  simulation  of  Theorem  5.2  is  more  efficient. 
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Corollary  5.2.1.  PRAM  [*,  +]-PTIME  =  PSP  ACE. 

We  now  present  the  simulation  of  a  PRAM[+]  by  a  PRAM.  Let  D  be  a  PRAMJ+]  that 
uses  T (n)  time  and  P  (n)  processors.  We  construct  a  PRAM  2  that  simulates  D  in  time 
0(T(n)  log  (n+T (n))).  Z  acts  as  a  circuit  to  simulate  the  computation  of  D. 

Simulation.  We  modify  the  simulation  of  a  PRAM[*]  by  a  PRAM  from  Section  4.1.  In 
T  ( n )  steps,  a  PRAMf*]  can  build  integers  of  length  n 2T(n\  whereas  a  PRAM[+]  can  build 
only  integers  of  length  0(n+T (n)).  As  a  result,  Z  partitions  the  memory  into  blocks 
containing  only  0  (n+T ( n ))  cells  each.  Z  activates  P  ( n )  primary  processors,  each  with 
O  ((1/54)  (n+T (n))1+8)  secondary  processors.  The  simulation  proceeds  along  the  same  lines 
as  in  Section  4.1  until  a  division  instruction  arises.  By  Lemma  5.1.1,  Z  can  perform  a 
division  in  time  O  (log  ( n+T (n))). 

Theorem  5.3.  For  all  T(n)  2  log  n,  PRAM  [+]-TIME  (T (n))  £ 

PRAM -TIME  (T(n  )  log  (n+T  («)))• 

Proof.  By  the  simulation  above,  Z  simulates  D  in  time  0(T(n)  log  (n+T (n)))  with 
O ((P (n )/54)  (n+T (n))1+*)  processors.  O 

Corollary  5.3.1.  PRAM  [+}-PTlME  =  PRAM-PTIME. 

Corollary  5.3.2.  PRAM  [+]-POLYLOGTIME  =  PRAM  -POLYLOGTIME 

Observe  that  a  RAM[+]  is  unable  to  quickly  generate  long  integers.  Therefore,  the  gap 
between  the  time-bounded  power  of  a  RAM[-*-]  and  the  time-bounded  power  of  a  PRAM[+] 
is  much  greater  than  the  gap  between  the  power  of  a  RAM[*]  and  the  power  of  a  PRAM[*]. 
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Let  PC  =  {PC  PC 2,  -■}  be  the  family  of  bounded  fan-in  circuits  that  simulates  the 
family  C  of  unbounded  fan-in  circuits  described  by  Stockmeyer  and  Vishkin  (1984).  For  a 
fixed  input  size  n,  the  depth  of  PCn  is  0(T (n)  log  P  ( n)T ( n ))  and  the  size  is 
O  (P  {n  )T  ( n)[T  ( n  )(n  +T  (n ))  +  (n  +7  ( n  ))3  +  (n  +T  (n  ))(n  +P  ( n)T  (n ))]). 

Theorem  5.4.  For  each  n,  every  language  recognized  by  a  PRAM[+]  D  in  time  T  (n)  with 
P  {n)  processors  can  be  recognized  by  a  log-space  uniform,  bounded  fan-in  circuit  DBn  of 
depth  O  (T  (n )  log  P  (n )  +  T  {n )  log(n  +T  (n ))  loglog  ( n  +T  (n )))). 

Proof.  Fix  an  input  length  n.  Let  PCn  be  the  bounded  fan-in  circuit  described  above  that 
simulates  a  PRAM.  Let  DBn  be  PCn  with  additional  circuit  blocks  for  division.  To  handle 
division  instructions  with  operands  of  length  at  most  x  =  n  +  T(n),  we  use  the  log-space 
uniform,  O  (log  x  loglog  jc)  depth,  bounded  fan-in  division  circuit  specified  in  Lemma  5.2.1. 
Circuit  DBn  is  at  most  at  constant  factor  larger  in  size  than  PCn.  Hence,  DBn  uses  depth 
0(logP(n)  +  login +T (n))  loglog  (n+T (n)))  to  simulate  each  step  of  D.  □ 

Lemma  5.5.1.  An  off-line  Turing  machine  can  compute  the  quotient  of  two  n  bit  operands 
in  O  (log  n  loglog  n)  space. 

Proof.  Borodin  (1977)  proved  that  an  off-line  TM  can  simulate  a  log-space  uniform  circuit 
with  bounded  fan-in  and  depth  D{n)  in  space  O  (D  (n)).  Combined  with  the  log-space 
uniform  0(log  n  loglog  n)  depth  division  circuit  of  Shankar  and  Ramachandran  (1987),  we 
have  the  lemma.  □ 

Theorem  5.5.  For  all  T(n)  >  log  n,  PRAM  [+]-TIME  (T (n))  £.DSPACE(T2(n)). 

Proof.  Fortune  and  Wyllie  (1978)  simulated  each  PRAM  running  in  time  T(n)  by  a  TM 
running  in  space  O  (T2(n)).  They  used  recursive  procedures  of  depth  O  ( T ( n ))  using  space 
O  ( T ( n ))  at  each  level  of  recursion.  If  we  augment  the  simulated  PRAM  with  division,  then 
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by  Lemma  5.5.1,  an  additional  O  (log  ( n+T ( n ))  loglog  ( n+T ( n )))  space  is  needed  at  each 
level,  so  O  ( T ( n ))  space  at  each  level  still  suffices.  Hence,  with  linear  space  compression,  a 
TM  with  space  T2(n)  can  simulate  a  PRAM[+]  running  in  time  T ( n ).  □ 

Corollary  5.5,1.  PRAM  [+]-PTIME  =  PSP  ACE. 
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Chapter  6.  Shift 

Pratt  and  Stockmeyer  (1976)  proved  that  for  vector  machines,  that  is,  a  RAM  [T,i] 
without  addition  or  subtraction  in  which  left  shift  (t)  and  right  shift  (i)  distances  are 
restricted  to  a  polynomial  number  of  bit  positions,  RAM  [T,i]-PTMf£  =  PSP  ACE .  Simon 
(1977)  proved  the  same  equality  for  RAMs  with  unrestricted  left  shift  and  right  shift, 
addition,  and  subtraction.  We  prove  that  polynomial  time  on  PRAMs  with  unrestricted  shifts 
is  equivalent  to  polynomial  time  on  basic  PRAMs  and  to  polynomial  space  on  Turirg 
machines  (TMs). 

6.1.  Simulation  of  PRAM[T,1]  by  PRAM 

By  repeated  application  of  the  left  shift  instruction,  a  PRAM[t,i]  can  generate  numbers 
of  length 

o\i2 ']  r^t) 

in  T{n )  steps.  These  extremely  large  numbers  will  contain  very  long  strings  of  0’s,  however. 
(If  Boolean  operations  are  used,  then  the  numbers  will  have  very  long  strings  of  0’s  and  very 
long  strings  of  l’s.)  Since  we  cannot  write  such  numbers  in  polynomial  space,  nor  can  we 
address  an  individual  bit  of  such  a  number  in  polynomial  space,  we  encode  the  numbers  and 
manipulate  the  encodings.  We  use  the  marked  interesting  bit  (MIB)  encoding,  an 
enhancement  of  the  interesting  bit  encoding  of  Simon  (1977).  Let  d  be  an  integer.  Detine 
len  (d)  as  the  minimum  integer  w  such  that  -2K,_1  <  d  <,  2W_1-1.  Let  bw.\  -  •  •  bo  be  the  w- 
bit  two’s  complement  representation  of  d.  An  interesting  bit  of  d  is  a  bit  b,  such  that 
b,  *  b,+ ].  (bw  is  not  an  interesting  bit.) 
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If  d  has  an  interesting  bit  at  bt  and  the  next  interesting  bit  is  at  bj ,  i  <  j ,  then  the  bits 
bjbj-i  ■  ■  •  bi+ 1  are  identical.  If  these  bits  are  O’s  (l’s),  then  we  say  that  d  has  a  constant 
interval  of  O’s  ( 1  ’s)  at  bj . 

If  a  constant  interval  has  length  1,  then  the  entire  interval  is  a  single  bit,  which  is  an 
interesting  bit.  We  call  such  an  interesting  bit  a  singleton.  We  mark  interesting  bits  that  are 
singletons.  We  define  the  MIB  encoding  as 
E(0)  =  Os , 

E(01) =  Is, 

E(d )  =  (E (a,  )q, . E(a  2)q  2,  E(a  i  )q , ;  r ), 

where  d  is  an  integer,  a}  is  the  position  of  the  j  th  interesting  bit  of  d ;  q}  =  s  if  the  j  th 
interesting  bit  is  a  singleton  and  q}  =  m  if  the  j  th  interesting  bit  is  not  a  singleton;  and  r  is 
the  value  (0  or  1)  of  the  rightmost  bit  of  d .  Call  qt  the  mark,  of  the  j  th  interesting  bit.  Call  r 
the  start  bit. 

For  example,  E(01100)  =  (E(011)m,  E(01)m;  0)  =  ((E(01)m;  1  )m.  Ism;  0)  =  ((Ism; 
l)m,  lsm;0).  For  all  d,  define  val(E(d))  =  d.  The  marks  of  interesting  bits  will  be  useful 
for  quickly  computing  E(d+1)  from  E(d). 

An  encoding  can  be  viewed  as  a  tree.  For  the  encoding  E (d)  =  (E(a,  )q,  , ... ,  E{ai)q 2, 
E(a  i)<7!  ;  r ),  a  node  is  associated  with  each  of  E(a,  )qt  , ... ,  E(a\)q\,  r .  A  root  node  is 
associated  with  the  entire  encoding  of  E id )  and  holds  nothing.  A  nonroot  node  holds  one  of 
Os ;  Is;  r,  the  start  bit;  or  q} ,  the  mark  cf  the  j  th  interesting  bit.  If  a  node  holds  Os ,  Is ,  or  r , 
then  it  is  a  leaf.  If  a  node  holds  q} ,  then  it  is  an  internal  node,  and  its  children  are  the  nodes 
of  E(a; ).  Figure  6.1  contains  a  sketch  of  the  encoding  tree  of  E(01 100). 


Figure  6.1.  Encoding  tree  for  E(01100) 


For  a  node  a  corresponding  to  E(ak  )qk ,  the  value  of  the  subtree  rooted  at  a,  val  (a),  is 
ak ,  the  value  of  E(a* ).  Thus,  val  (a)  is  the  position  of  an  interesting  bit. 

We  define  level  0  of  a  tree  as  the  root.  We  define  level  j  of  a  tree  as  the  set  of  all 
children  of  nodes  in  level  j-\  of  the  tree. 

A  pointer  into  an  encoding  specifies  a  path  starting  at  the  root  of  the  tree.  For  instance, 
the  pointer  7.5.9  specifies  a  path  xopcijc^jc^  in  which  xo  is  the  root,  x  \  is  the  7th  child  (from 
the  right)  of  xq,  xz  is  the  5th  child  of  *i,  and  x  3  is  the  9th  child  of  x^.  A  pointer  also 

specifies  the  subtree  rooted  at  the  last  node  of  the  path. 

I 

For  an  integer  d,  suppose  E (d)  =  (E(a,)qt, ....  E(a\)q\\r).  We  define  intbits (d)  =  t , 

I  the  number  of  interesting  bits  in  d .  Viewing  E(d )  as  a  tree,  we  refer  to  E(o, )  as  a  subtree  of 

j  E(d ).  We  define  the  k  th  subtree  at  level  c  of  E (d )  as  the  k  th  subtree  from  the  right  whose 

I 

root  is  distance  c  from  the  root  of  E (d).  We  define  depth  (d )  recursively  by 

I 

I 
I 


depth  (0)  =  depth  ( 1 )  =  1 , 

depth  (d )  =  I  +  max  { depth  (a, ), ....  depth  (a  \ ) } . 
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We  now  present  three  lemmas,  analogous  to  those  of  Simon  (1977),  that  bound  the 
length  of  a  pointer  into  an  encoding.  Lemma  6.1.1  bounds  the  depth  of  an  encoding  and  the 
number  of  interesting  bits  in  a  number  generated  by  a  PRAM[T,i],  Let  bool  be  a  set  of 
Boolean  operations. 

Lemma  6.1.1.  Suppose  a  processor  executes  r(i)*-r(j)  O  r(Jfc),  O  6  {+,  T,  1, 
bool } . 

i)  If  O  is  +,  then  depth  ( rconm  (i ))  £  1  +  max  { depth  ( rconm  (J )),  depth  ( rconm  (k ))} 
and  intbits  ( rconm  ( i ))  <  intbits  ( rconm  (J ))  +  intbits  ( rconm  ( k )). 

ii)  If  O  is  a  Boolean  operation,  then  depth  ( rconm  ( i ))  < 
max  ( depth  ( rconm  (J )),  depth  ( rconm  ( k )) }  and  intbits  ( rconm  (t ))  <, 
intbits  ( rconm  (J ))  +  intbits  ( rconm  ( k )). 

iii)  If  O  is  -,  then  depth  ( rconm  ( i ))  £  1  +  max  { depth  ( rconm  (J )),  depth  ( rconm  (£))} 
and  intbits  ( rconm  (i ))  ^  intbits  ( rconm  (J ))  +  intbits  ( rconm  ( k )). 

iv)  If  O  is  T  or  i,  then  depth  {rconm  (i ))  S  2  +  ma \[ depth  ( rconm  (/)). 
depth  (rconm  ( k )) }  and  intbits  ( rconm  (t ))  <  1  +  intbits  ( rconm  (J )). 

Proof,  i)  If  Pm  executes  r (i  )*—r ij)  +  r{k),  then  an  interesting  bit  of  rconm  ( i )  is  either  in 
the  same  position  or  is  one  position  to  the  left  of  an  interesting  bit  of  rconm  (J)  or  rconm  ( k ). 
As  a  result,  depth  may  increase  by  at  most  1.  For  example,  using  two’s  complement 
representations,  if  rconm  (j)  =  01 1  and  rconm(k)  =  01,  then  rconm (i )  =  0100.  The  depth  of 
both  addends  is  1,  while  the  depth  of  their  sum  is  2.  Thus,  depth  ( rconm  ( i ))  will  be  at  most  1 
+  rmx{depth (rconm(j )),  depth (rconm(k))},  and  intbits  (rconm(i ))  will  be  at  most 
intbits  ( rconm  (J ))  +  intbits  ( rconm  ( k )). 
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ii)  If  Pm  executes  r  ( i  )«-r  (J )  v  r  (k ),  then  an  interesting  bit  of  rconm  ( i )  is  an 
interesting  bit  of  either  rconm  (J )  or  rconm  (k ).  Thus,  depth  ( rconm  ( i ))  will  be  at  most 
max  { depth  ( rconm  (J )),  depth  ( rconm  (k )) } ,  and  intbits  (rconm  (i ))  will  be  at  most 

intbits  ( rconm  (J ))  +  intbits  ( rconm  (k )).  Other  Boolean  operations  produce  the  same  results. 

iii)  If  Pn  executes  r(i)*-r(J)  -r(k),  and  we  view  r(j)  -  r(k)asr(J)  +  (1  +  -ir(k)), 
then  by  Pans  i)  and  ii),  depth  ( rconm  ( i ))  will  be  at  most  1  + 

max { depth  ( rconm  ( j )),  depth ( rconm (k))},  and  intbits  ( rconm ( i ))  will  be  at  most 
intbits  ( rconm  (J ))  +  intbits  ( rconm  ( k )). 

iv)  If  Pm  executes  r  (i  )  w  (J )  T  r  (k )  and  E (rconm  (J ))  =  (E (aq ) . E (a  i);  w ),  w  6 

{0,1 ),  then  E(rconm  ( i ))  =  ( E(aq+rconm (k )), ....  E (a  \+rconm  (k )),  [E (rconm  ( k ))];  x ),  where 
we  include  E(rconm  ( k ))  if  and  only  if  rconm (k)  =  0  and  w  =  1  x  =  w  if  rconm (k )  =  0,  x  =0 
if  rconm  (k)>  0.  The  claim  holds.  By  a  similar  argument,  the  claim  holds  for  right  shift.  □ 

Part  i)  of  Lemma  6.1.2  bounds  the  number  of  subtrees  below  first  level  nodes  in  an 
encoding;  Part  ii)  bounds  the  number  of  subtrees  below  /  th  level  nodes  in  an  encoding,  /  > 

1. 

Lemma  6.1.2.  Suppose  a  processor  Pm  executes  r  ( i  )*-r  ( j )  O  r(k),  where  Os  {+,  T,  i, 

-,  bool } ,  E (rconm  ( i ))  =  (E (ar ), ...,  E (a  t);  w, ),  E (rconn  (J ))  =  (E (bs ) . E(b  t);  w} ),  and 

E(rconm  (k ))  =  (E(ct ), ....  E(c  j);  wk),  where  av,bv,  and  cv  denote  the  positions  of  the  v  th 

interesting  bits  of  rconm  ( i ),  rconm  (J ),  and  rconm  (k ),  respectively. 

i)  For  E(av )  (that  is,  the  v  th  subtree  at  level  1  of  E( rconm  (i ))), 

a;  if  O  is  +,  then  intbits (av )  <  1  +  max  { intbits  ( bg ),  intbits (c„)}, 

<? 

b)  if  O  is  T  or  i,  then  intbits  (av )  £  max  { intbits  (bg)}  +  intbits  ( rconm ( k )),  and 

<? 

c)  if  O  is  a  Boolean  operation,  then  intbits  (av )  S  max  { intbits  (bg ),  intbits  (ca ) } , 

4  ’  ^ 
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and 

d)  if  O  is  then  intbits  (av )  <  1  +  max  { intbits  {b„ ),  intbits  (cQ ) } . 

<7 

ii)  For  E((3)  a  subtree  at  level  /  >  1,  intbits  ((3)  £  1  +  max  { intbits  (q  th  subtree  of 

<7 

rconm  (J )  at  level  / ),  intbits  (q  th  subtree  of  rconm  (k )  at  level  / ) } . 

Proof,  i)  A  first  level  subtree  E (av )  encodes  the  position  p  of  an  interesting  bit  in  rconm  ( i ). 
The  subtrees  of  E(av )  encode  the  positions  of  interesting  bits  in  p . 

a)  If  Pm  executes  r(i  )*—r(J )  +  r(k),  then,  as  can  be  seen  in  ADD ,  p  is  either  the 
position  of  an  interesting  bit  in  rconm  (J )  or  rconm  (k)  or  p  is  one  plus  the  position  of  an 
interesting  bit  in  rconm  ( j )  or  rconm  (k ).  In  the  first  case,  intbits  (av )  < 

max  { intbits  (bq ),  intbits  (cq ) } .  In  the  second  case,  by  Lemma  6. 1 . 1  i),  intbits  (av )  <  1  + 

max  { intbits  (bQ ),  intbits  (<:„)}. 

<7 

b)  If  Pm  executes  r  (i  )*-r  {j )  T  r  (k ),  then  p  is  the  sum  of  rconm  (k )  and  the  position 
of  an  interesting  bit  of  rconm  (J).  By  Lemma  6.1.1,  intbits  (av )  <  max  { intbits  (bQ)\  + 

<7 

intbits  ( rconm  (k )).  By  a  similar  argument,  the  claim  holds  for  right  shift 

c)  If  Pm  executes  r(i)*-r(J)V  r{k),  then  p  is  the  position  of  an  interesting  bit  in 

either  rconm (j )  or  rconm (k ).  Thus,  intbits  (av)  <,  max  [intbits (bQ),  intbits  (ca)).  Other 

<7 

Boolean  operations  produce  the  same  results. 

d)  If  Pm  executes  r  ( i  )<—  r  (J)-  r(k),  then  the  claim  holds  by  Pans  i)a)  and  i)c). 

ii)  For  any  instruction,  we  add  at  most  1  to  the  value  of  a  subtree  of  level  /  >  1 ;  hence. 
Part  i)a)  applies.  □ 

Lemma  6.1.3.  A  pointer  can  be  specified  in  O  (T2(n ))  space  on  a  TM. 
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Proof.  Let  d  be  an  integer  generated  by  a  PRAM[T,i].  By  Lemma  6.1.1, 
depth  {d)<>2T  ( n ).  If  to  is  the  input  to  the  PRAM[t,i]  and  to  has  length  n ,  then 
intbits  (c o)  <n.  Let  H((3)  be  either  E(d )  or  a  subtree  of  E  (d ).  By  Lemmas  6.1.1  and  6.1.2, 
intbits  (P)  <  n  2T(~n\ Therefore,  any  leaf  in  E (d)  can  be  specified  by  a  pointer  of  length 
O  (T(n )  +  log  n )  =  O  {Tin )).  (The  tree  has  2 T ( n )  levels,  and  we  need  space  O  (T (n ))  to 
specify  the  branch  at  each  level.)  □ 

We  describe  here  an  efficient  simulation  of  a  PRAM[1\  -l]  by  a  basic  PRAM.  Let  S  be 
a  PRAM[T,I]  that  uses  T (n )  time  and  P  (n )  processors.  Let  S'  be  a  PRAM[t,i]  that  uses 
only  short  addresses  and  simulates  S  according  to  the  Associative  Memory  Lemma.  Thus,  S 
uses  O  (P2(n  )T (n ))  processors,  O  ( T ( n ))  time,  and  only  addresses  in  0,  1, ...,  O  ( P  ( n  )T  (n )). 
Let  q  be  a  constant  such  that  each  processor  in  S'  uses  only  q  registers. 

For  numbers  generated  by  S  (and  therefore  S'),  the  depth  of  the  encoding  is  at  most 
2T (n ),  and  every  internal  node  has  at  most  n  2T(n)  children.  Therefore,  the  encoding  may 
have  up  to  4r^n  >  nodes. 

We  construct  a  PRAM  Z  that  simulates  S  via  S'  in  O  ( T2(n ))  time,  using 
O  (P  2(n  )T(n)  4r^n'1)  processors.  For  ease  of  description,  we  allow  Z  to  have  <7+7  separate 
shared  memories,  mem  o, ...,  memq+t,  which  can  be  interleaved.  This  entails  no  loss  of 
generality  and  only  a  constant  factor  time  loss  (Lemma  3.1). 

Initialization.  Z  partitions  memo  into  O  (P  (n  )T(n))  blocks  of  cells  each.  This 
partitioning  allots  one  block  per  cell  accessed  by  S',  where  each  block  comprises  one  cell  per 
node  of  the  encoding  tree.  Z  partitions  each  of  mem  i, ....  memq  into  O  (P2(n  )T  ( n ))  blocks. 
This  allots  one  block  per  processor  of  S'  and  one  memory  per  local  register  used  by  a 
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processor.  (See  Figure  6.2.)  Let  fl,(m)  denote  the  mth  block  of  mem,  .  Throughout  the 
simulation,  Bq(J)  contains  E (con  (/')),  and  flj(m),  1  ^  i  £  <7,  contains  E(rcon„,(0)  of  S'. 

Z  activates  O  (P  2  (n)T (n))  primary  processors,  one  for  each  processor  used  by  S'.  In 
mem,+1,  these  processors  construct  an  address  table.  The  yth  entry  of  this  table  is  j-4rl(n), 
the  address  of  the  first  cell  of  the  yth  block  in  every  memory.  The  maximum  address  is 
O  ( P  ( n)T ( n)4T2{n) ),  so  this  address  (and  the  entire  table)  is  computed  in  O  (T2(n))  time. 

Each  primary  processor  now  deploys  47'2(n)  secondary  processors,  one  for  each  cell  in  a 
block,  in  O  (T2(n))  time.  To  implement  a  broadcast  in  constant  time,  each  primary  processor 
Pm  uses  cq+2(m)  as  a  communication  cell.  When  the  secondary  processors  are  not  otherwise 
occupied,  they  concurrently  read  this  cell  at  each  time  step,  waiting  for  a  signal  from  the 
primary  processor  to  indicate  their  next  tasks. 

Consider  a  complete  d-ary  tree  A  with  depth  2 T (n).  We  number  the  nodes  of  A, 
starting  with  the  root  as  node  1,  in  the  order  of  a  right-to-left  breadth-first  traversal.  Node 
number  y  has  children  numbered  dj-(d  -2), ...,  dj,  dj+ 1;  its  parent  is  numbered 
l(j+(d-2))/d]  . 


memo 

shared  memory 

mem  1 , ....  memq 

local  memories 

memq+i 

address  table 

mem, +2 

communication 

memqJr  3 

rightmost  child 

mem,  44 

parent 

mem, +5 

rightmost  sibling 

mem, +6 

two’s  complement  representation 

of  cell  contents 


Figure  6.2.  Shared  memories  of  Z 
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We  view  a  block  as  a  linear  array  storing  A  with  d  =  4r(n).  Node  numbers  correspond 
to  locations  in  the  array.  Let  node  (J)  denote  the  node  whose  number  is  j.  Let  num  (a) 
denote  the  node  number  of  node  a.  Let  p  (a)  denote  the  parent  of  node  a;  let  rc  (a)  denote 
the  rightmost  child  of  node  a.  For  each  primary  processor,  the  y'th  secondary  processor, 

1  <,  j  <  4r2(n\  handles  node  (J).  Let  proc  (a)  denote  the  secondary  processor  assigned  to 
node  a. 

Each  encoding  is  a  subtree  of  A  because  some  encoding  nodes  may  have  fewer  than 
4r(n)  children.  Let  Ic  (a)  denote  the  leftmost  nonempty  child  of  node  a.  When  a  primary 
processor  and  its  secondary  processors  update  E {con  (j))  or  E (rcong(i)),  they  also  update 
num  {Ic  (ct))  for  every  node  cl  Let  right  (a)  denote  num  (a)  -  num  {rc{p  (a))).  That  is, 
right  (a)  denotes  which  child  a  is  of  p  (a),  counting  from  the  right.  Similarly,  let  left  (a) 
denote  num  (a)  -  num  {Ic  ( p  (a))).  That  is,  left  (a)  denotes  which  child  a  is  of  p  (a),  counting 
from  the  left. 

Using  memq+2  for  communication  with  primary  processor  Ph,  corresponding  to 
processor  Pq  of  S' ,  proc  {node  {j)),  1  <,  j  £4‘r2(,,),  computes  num  {rc  {node  (J)))  in  0{T {n)) 
time  and  wine*  it  ii>  3O)  Ner.t,  proc  {node  {]))  determines  num  (p  {node  {j)))  and  writes 
this  number  in  memqJ^.  It  writes  j  in  cq^x{num  {rc  {node  (/))))•  Note  that  j  is 
num  {p  {rc  {node  (/))))•  On  the  fcth  cycle,  each  secondary  processor  that  has 
num  (p  {node  (/)))  (that  is,  conq+4{J)  *  0)  writes  that  number  2*  cells  away  (in 
memq^lj+l*)),  if  that  cell  is  empty.  After  Z  repeats  this  procedure  T (n)-l  times,  conq+4(j) 
=  num  {p  {node  (/))),  for  all  j.  In  this  procedure,  the  processors  also  write 
num  {rc  {p  {node  (/'))))  in  memq+5{J).  Then  the  processor  for  each  node  j  can  compute 
right  {j). 


54 


All  the  addresses  of  cells  accessed  by  S'  can  be  constructed  using  only  addition  and 
subtraction.  In  order  to  quickly  perform  indirect  addressing,  Z  generates  all  cell  and  register 
contents  in  standard  two’s  complement  representation,  except  for  results  of  shifts.  If  the 
value  v  in  a  register  or  a  shared  memory  cell  is  the  result  of  a  shift,  then  S'  will  not  use  v  as 
an  address,  and  S'  will  use  no  other  value  computed  from  v  as  an  address. 

Recall  that  register  addresses  are  in  the  range  0,  •  •  • ,  q  -1.  The  two’s  complement 
representation  of  local  register  rg(i)  of  S',  if  rcong(i)  is  constructed  without  shifts,  is  stored 
in  cq+6(g  (q  +1)  +  /).  The  two’s  complement  representation  of  shared  memory  cell  c  (J)  of 
S',  if  con(j)  is  constructed  without  shifts,  is  stored  in  c?+6((/+1)(<7+1)). 

As  the  final  initialization  step,  Z  converts  the  input  to  the  MIB  encoding,  writing  the 
encoding  into  B0(0)-  Z  writes  the  input  integer  in 

Simulation  In  a  general  step  of  processor  Pg  of  S',  Pg  executes  instruction  instr. 
Assume  for  now  that  instr  has  the  form  r  (i)  rfj)  O  r(k).  To  simulate  this  step,  the 

corresponding  primary  processor  Pn  of  Z  and  its  secondary  processors  perform  four  tasks: 
Task  1.  If  O  is  not  a  shift,  then  perform  O  on  conq+6(g  (<jr+l)  +  j)  and 
conq+6(g(q+l)  +  k),  writing  the  result  in  conq+6(g  (q+\)  +  i). 

Task  2.  Merge  the  first  level  of  the  encodings  E (rcong(J))  and  E (rcong(k)). 

Task  3.  Determine  where  the  interesting  bits  of  E(rcong(i))  occur  in  the  merged 
encodings  and  compute  their  marks. 

Task  4.  Compress  these  marked  interesting  bits  into  the  proper  structure. 

Z  uses  the  procedures  MERGE  in  Task  2  and  COMPRESS  in  Task  4.  Depending  on  the 
operation  O,  Z  may  also  use  the  procedures  BOOL  and  ADD  in  Task  3.  These  procedures 


are  described  below. 
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Procedures  MERGE,  COMPRESS,  BOOL,  and  ADD  call  procedure  COMPARE,  which 
we  now  specify.  Let  j  and  k  be  nonnegative  integers,  and  let  and  y2  be  encoding 
pointers.  If  m  =  X,  the  empty  string,  then  COMPARE  (j,  \f\,k,\ jr2,  m)  compares  the  value 
of  subtree  E (con  (j))-Vi  with  the  value  of  subtree  E (con  ( k)).\y2 ■  COMPARE  returns 
“equal”  if  val(E(con(j)).yx)  =  val(E(con(k)).\f2),  “greater”  if  va/(E(con (;'))•  Vi )  > 
val(E(con(k)).y2),  or  “less”  if  val(E(con  )  <  val(E(con(k)).\\i2).  Similarly,  if  m  *  X, 
then  COMPARE  compares  the  value  of  subtree  E( rconm(j)).\i/i  with  the  value  of  subtree 
E{rconm{k))Mf2-  COMPARE  returns  “equal”  if  val(fE(rconm(j))Mfx)  =  val(E(rconm(k)).\\f2), 
“greater”  if  vaI(E(rconm(J)).\\f\)  >  val(E(rconm(k)).\r2),  or  “less”  if  val(E(rconm(J)).y{)  < 
val(E(rconm(k)).\\r2). 

Suppose  m  =  X;  the  case  m  *  X  is  similar.  For  each  node  a  in  the  first  level  of 
E (con  (J)).\y x  simultaneously,  proc  (a)  determines  left  (a).  Then  proc  (a)  computes  num($) 
such  that  node  p  is  in  the  first  level  of  E(con  (fc))-V2  and  left  (P)  =  left  (a)  by  reading 
num  ( Ic  (Eicon  (k)).\\f2)).  Next,  proc  (a)  recursively  compares  the  values  of  the  subtrees 
rooted  at  a  and  |3.  If  the  interesting  bit  whose  location  is  specified  by  val  (a)  is  the  end  of  a 
constant  interval  of  l’s  (0’s),  then  proc  (a)  writes  in  the  num (a)th  cell  in  Bq+2(g)  which 
subtree  has  the  greater  (lesser)  value,  but  writes  nothing  if  the  subtrees  have  equal  value.  By 
the  concurrent  write  rules,  the  num(a)th  cell  in  Bq+2(g)  either  holds  the  name  of  the  cell  (J 
or  k)  whose  subtree  has  greater,  value  or  holds  nothing  if  they  are  equal. 

Computing  these  node  numbers  takes  constant  time.  COMPARE  is  recursive  in  the 
depth  of  the  encoding,  taking  constant  time  at  each  level.  Consequently, 

COMPARE  (j,  V}/!,  k,  y2,  m)  takes  O  (T (n))  time. 
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In  Task  2,  Z  merges  the  first  level  of  the  encodings  E (rcong(j))  and  E (rcong(k)).  Z  does 
this  to  compare  the  positions  of  interesting  bits  in  rcong(j)  and  rcong(k).  This  comparison  is 
necessary  to  determine  the  positions  of  the  interesting  bits  in  rcong(i). 

The  subtrees  rooted  at  the  first  level  of  E (d)  form  a  list  sorted  in  increasing  order  by 
their  values.  The  beginning  of  the  list  corresponds  to  the  rightmost  child  of  the  root. 

MERGE  (J,  k,  i)  returns,  in  Bt(g),  the  list  resulting  from  merging  the  first  levels  of 
E (rcong(J))  and  E (rcong(k)).  The  list  contains  up  to  O  (2r(n))  subtrees,  each  of  which  is  a 
subtree  of  E (rcong(J))  or  E(rcong(k)).  Each  subtree  in  the  merged  list  retains  indications  of 
whether  it  is  from  j  or  k,  whether  it  is  the  end  of  a  constant  interval  of  0’s  or  1  ’s,  and  its 
(singleton)  mark. 

In  parallel,  Z  compares  val  (a),  for  each  subtree  rooted  at  a  in  the  first  levei  of 
E (rcong(j)),  with  val  ((3),  for  each  subtree  rooted  at  3  in  the  first  level  of  E(rcong(k)).  If 
val  (a)  >  val  (P),  then  proc  (a)  writes  right  (p)  into  the  num (a)th  cell  of  Bq+2(g).  By  the 
concurrent  write  rules,  this  cell  contains  the  largest  right  ($)  for  which  val  (a)  >  val  (p).  Call 
this  value  max  (P).  For  each  a,  right  (cl)  +  max(p)  specifies  the  position  of  the  subtree  rooted 
at  a  in  the  merged  list.  Next,  for  each  P,  if  val  (P)  >  val  (a),  then  proc  (p)  writes  right  (a) 
into  the  num  (P)th  cell  of  Bq+2(g)-  Call  this  value  max  (cl).  For  each  P,  right  (p)  +  max  (a) 
specifies  the  position  of  the  subtree  rooted  at  P  in  the  merged  list.  A  comparison  takes 
O  (T (n))  time,  so  Z  performs  a  MERGE  in  0(T (n))  time.  (Note:  Each  subtree  in  the 
merged  list  also  indicates  whether  its  value  is  equal  to  that  of  the  succeeding  subtree  in  the 
list.) 

We  introduce  one  more  procedure  before  describing  the  computation  of  the  interesting 
bits  of  rcong(i).  Let  1(d)  denote  the  MIB  encoding  of  d  without  the  marks. 


57 


PLUSjONE  ( k ,  \|/t ,  i,  X4/2 )  writes  l(val(E(rcong(k)).\\f\ )+  1)  in  the  location  set  aside  for 
subtree  Vj/2  in  Bt(g).  That  is,  given  E (d),  for  d  an  integer,  PLUSjONE  writes  I(d+1). 
PLUS_ONE  does  not  write  singleton  marks.  Z  uses  PLUS  ONE  to  generate  I(d+1)  to  test 
for  equality  with  E(:t ),  x  an  integer.  The  processors  ignore  marks  to  interpret  E(x)  as  I(jc). 

At  most,  the  two  rightmost  interesting  bits  of  d+l  will  be  different  from  those  of  d.  We  have 
four  cases  to  consider: 

(a)  d  starts  with  a  0,  the  0  is  a  singleton, 

(b)  d  starts  with  a  0,  the  0  is  not  a  singleton, 

(c)  d  starts  with  a  1,  the  first  0  is  a  singleton,  and 

(d)  d  starts  with  a  1,  the  first  0  is  not  a  singleton. 

In  every  case,  Z  complements  the  start  bit.  In  case  (a),  Z  deletes  the  first  interesting  bit  in 
E (d).  In  case  (b),  Z  adds  a  new  interesting  bit  at  location  0.  In  case  (c),  Z  deletes  the  second 
interesting  bit.  In  case  (d),  let  us  suppose  the  first  interesting  bit  is  at  location  /.  Z  deletes 
this  interesting  bit  and  creates  a  new  interesting  bit  at  location  /  +1  by  recursively  calling 
PLUS  ONE.  (Naturally,  when  Z  adds  or  deletes  one  interesting  bit,  it  shifts  the  subtrees 
encoding  the  locations  of  the  other  interesting  bits  one  position  left  or  right,  as  necessary.  In 
a  block,  the  associated  processors  copy  their  respective  subtrees  in  constant  time.) 

PLUS  ONE  is  recursive  with  depth  T ( n ),  the  depth  of  the  encodings.  PLUS  ONE  uses 
constant  time  at  each  level,  so  O  ( T  («))  time  overall. 

We  now  are  ready  to  describe  how  Z  accomplishes  Task  3.  Assume  without  loss  of 
generality  that  i,  j,  and  k  are  different.  Z’s  actions  in  Task  3  depend  on  the  operation  O  in 
instr.  Define  an  interval-pair  to  be  the  intersection  of  a  constant  interval  in  rcong(j)  and  a 
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constant  interval  in  rcong(k).  For  example,  three  interval-pairs,  denoted  a ,  b,  and  c,  are 
shown  below. 

c  cbbba 
rcorig(j)  1  1  000  1 

rcorig(k)  001111 

The  interval-pair  length  of  interval-pair  a  is  1,  of  interval-pair  b  is  3,  and  of  interval-pair  c  is 

2. 

ZERO _ONE(j,  k,  i)  takes  as  input  the  merged  list  from  E (rcong(J))  and  E(rcong(k))  in 
Sj(g)  and  returns  as  output  an  indication  for  each  subtree  in  the  list  from  E (rcong{J)) 
(respectively,  E (rcong(k)))  whether  rcong(k)  (respectively,  rcong(J))  is  in  a  constant  interval 
of  0’s  or  l’s  at  the  location  specified  by  the  value  of  the  subtree.  The  secondary  processors 
handling  the  merged  list  act  as  a  binary  computation  tree  to  pass  along  the  desired 
information  in  O  (T  (n))  time. 

IP  LENGTH  O',  k,  i)  takes  as  input  the  merged  list  from  E(rcong(J))  and  E (rcong  (k)) 
in  B,(g)  and  returns  as  output  an  indication  for  each  subtree  in  the  list  whether  the  interval 
pair  ending  at  the  location  specified  by  the  value  of  the  subtree  has  length  1  or  greater  than  1 . 
To  perform  this  computation,  Z calls  PLUS  ONE (t,  \j/j  ,  i\  Vi)  in  parallel  for  each  subtree 
in  the  list,  where  Oj  is  the  location  of  the  subtree  in  the  list  and  i'  refers  to  Bq+2(i).  Suppose 
the  subtree  encodes  the  integer  d.  Then,  in  parallel,  Z  tests  I(d+1)  for  equality  with  the 
succeeding  subtree  in  the  list.  If  they  have  equal  value,  then  the  interval-pair  length  is  1 ; 
otherwise,  the  interval-pair  length  is  greater  than  1 . 

If  instr  is  r  (i)i—  r(j)V  r(k),  then  Z  calls  SOOL  0,  k ,  i,  v  ).  BOOL(j,  k,  i,  V)  writes 
E(rcong(j)  v  rcong(k))  in  Bt(g).  Assume  that  we  have  the  merged  encodings  of  E{rcong(J )) 
and  E (rcong{k))  in  5,(g).  Z  must  compute  the  interesting  bits  and  their  marks.  Z  performs 
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two  preliminary  steps: 

(1)  for  each  subtree  in  the  list  from  E (rcong(j))  (respectively,  E(rcong(k))) 
determine  whether  rcong(k )  (respectively,  rcong(J ))  is  in  a  constant  interval  of  0’s 
or  l’s  at  the  location  specified  by  the  value  of  the  subtree  and 

(2)  determine  whether  the  interval-pair  length  is  1. 

Z  calls  ZERO  ONE  (J,  k,  i)  and  IP ^LENGTH  (J,  k,  i)  to  perform  (1)  and  (2),  respectively. 

A  subtree  is  interesting  if  its  value  is  the  location  of  an  interesting  bit  of  rcong(i);  otherwise, 
the  subtree  is  boring.  Following  the  rules  in  Appendix  A,  the  processors  associated  with 
each  subtree  tag  it  as  “interesting”  or  “boring.”  It  remains  for  Z  to  compute  the  marks  of 
the  interesting  bits.  For  each  interesting  subtree,  the  processor  associated  with  the  root 
determines  the  following.  If  the  interval-pair  has  length  1  and  the  preceding  subtree  in  the 
list  is  a  nonblank,  then  mark  it  s\  otherwise,  mark  it  m.  The  entire  procedure  takes  time 
O  ( T  (n)).  Other  Boolean  instructions  are  handled  similarly. 

If  instr  is  r  (i)*—r  (J)  +  r  ( k ),  then  Z  calls  ADD  {J,  X,  k,  X,  i,  X). 

ADD(J ,  \\fu  k,  \j/2,  i,  \\tj)  writes  E(va/(E(rcon?(y')).\)/1)  +  val{  E(rcong(k)).\\f2))  in  the 
location  set  aside  for  subtree  xj/3  in  B,(g).  Again,  assume  that  we  have  the  list  of  merged  first 
level  subtrees  of  E (rcong(J))  and  E(rcong(k))  in  Bt{g).  To  accomplish  Task  3,  Z  must  test 
four  conditions  at  the  bit  location  specified  by  the  value  of  the  subtree: 

(a)  whether  the  rcong(j )  and  rcong(k )  pairs  are  both  in  constant  intervals  of  0’s, 
both  in  1  ’s,  or  one  in  0’s  and  one  in  l’s, 

(b)  whether  there  is  a  carry-in  to  the  interval-pair, 

(c)  whether  rcong(i)  is  in  a  constant  interval  of  0’s  or  1  ’s  prior  to  the  start  of  the 
interval-pair,  and 


(d)  whether  the  interval-pair  length  is  1  or  greater  than  1 . 

2  calls  ZERO_ONE  and  IP  LENGTH  to  test  conditions  (a)  and  (d)  in  time  O  (T  (n )).  For 
each  subtree  a  in  the  list,  proc  (a)  does  the  following.  To  test  condition  (b),  proc  (a)  tests 
condition  (a)  at  the  preceding  subtree  in  the  list.  If  both  rcong(j)  and  rcong(k)  are  in  0’s, 
then  there  is  no  carry-in;  if  both  are  in  1  ’s,  then  there  is  a  carry-in.  If  one  is  in  0’s  and  the 
other  is  in  l’s,  then  the  carry-in  depends  on  the  carry-in  to  the  preceding  interval-pair.  To 
propagate  this  information  in  time  O  ( T  («)),  processors  again  act  as  a  binary  computation 
tree.  To  test  condition  (c),  proc  (a)  determines  whether  rcong(i)  is  in  a  constant  interval  of 
0’s  or  l’s  at  position  val (a)  using  the  other  three  conditions,  then  passes  this  information  to 
proc  (num(ct)+l).  The  processors  act  as  a  binary  computation  tree  of  height  O  ( T ( n ))  in 
testing  all  four  conditions.  Thus,  all  four  conditions  can  be  tested  in  O  (T (n))  time. 

Following  the  rules  in  Appendix  B,  for  each  subtree  a  in  the  list,  proc  (a)  determines 
whether  subtree  a  encodes  the  position  of  an  interesting  bit  of  rcong(i).  In  doing  so,  proc  (a) 
may  call  the  procedure  PLUS  ONE.  This  leads  to  an  overall  time  of  O  ( T  (n))  to  perform  an 
addition. 

If  instr  is  r  (i  )< — r  (j)  -r(k),  then  Z  computes  E(rcong{j}+l)  in  by  acall  to  ADD. 
then  calls  ADD  (/,  X,  k',  X,  i,  X),  where  k'  indicates  that  the  start  bit  of  E(rcong(k))  is 
complemented  (thus  adding  rcong(j)  and  the  two’s  complement  of  rcong{k)).  This  takes 
O  (T (n))  time. 

If  rcong(k)  <  0  and  instr  is  r(i)<—r(j)  T  r  {k),  then  Z  treats  instr  as  r  (i)ir-r  (j)  1  r{k ), 
substituting  \rcong{k)  \  for  rcong(k).  Similarly,  if  rcong(k)  <  0  and  instr  is 
'•(/)<—  r (j)  I  r(k),  then  Z  treats  instr  as  r(i)*-r(J)  T  r(k),  substituting  \rcong{k)\  for 
rcong{k).  Thus,  for  both  shift  instructions,  we  shall  assume  rcong(k)  >  0. 
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If  instr  is  r(i)<—r(j)  T  r(k),  then  the  dth  interesting  bit  of  rcong{i )  is  in  the  position 
specified  by  the  sum  of  rcong(k)  and  the  position  of  the  dth  (if  the  least  significant  bit  of 
rcong(J)  is  0)  or  d+lst  (if  the  least  significant  bit  of  rcong(J)  is  1  and  rcong(k)  *  0) 
interesting  bit  of  rcon.ij).  Z  adds  rcong(k)  to  the  value  of  each  subtree  from  rcong(j)- 
Marks  stay  the  same,  except  perhaps  for  the  first  interesting  bit  of  rcong(J)\  if  it  has  mark  s 
and  rcong(k)  =  0,  then  Z  marks  it  s;  otherwise,  Z  marks  it  m.  This  procedure  takes  O  (7 (n)) 
time,  the  time  to  perform  ADD. 

If  instr  is  r(i)*-r(J)  I  r(k),  then  Z  subtracts  rcong(k)  from  the  value  of  each  first  level 
subtree  of  rcong(J).  Z  tags  as  boring  those  subtrees  for  which  this  difference  is  negative.  For 
the  others,  this  difference  is  the  location  of  an  interesting  bit  in  rcong(i).  Let  y  denote  the 
subtree  whose  value  specifies  the  location  of  the  first  interesting  bit  in  rcong{i).  Marks  stay 
the  same,  except  perhaps  for  y:  if  val  (y)  =  0,  then  Z  marks  it  s;  otherwise,  Z  marks  it  m.  The 
stan  bit  of  rcong(i)  depends  on  whether  rcong(j)  is  in  a  constant  interval  of  0’s  or  l’s  at  the 
location  specified  by  the  subtree  that  became  y.  This  procedure  takes  time  O  (7  (n)),  the  time 
to  perform  ADD. 

Z  accomplishes  Task  4  by  calling  COMPRESS.  COMPRESS  ( i )  takes  the  contents  of 
block  t,  which  implicitly  stores  a  tree  in  which  some  subtrees  rooted  at  the  first  level  are 
tagged  to  be  deleted  (boring),  and  rewrites  the  tree  without  the  boring  subtrees.  The 
secondary  processors  act  as  a  binary  computation  tree  so  that  the  processors  associated  with 
the  root  of  each  interesting  (that  is,  not  boring)  first  level  subtree  can  determine  the  number 
of  interesting  subtrees  to  the  right  in  time  0  (7  ( n )).  This  number  specifies  the  location  of  the 
subtree  in  the  compressed  tree.  Then  Z  copies  each  subtree  into  the  appropriate  location  and 
writes  zeroes  in  the  unused  locations.  Overall,  COMPRESS  (i)  takes  O  (7  («))  time. 
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Now  let  us  consider  instructions  instr  executed  by  processor  Pg  of  S'  where  instr  has  a 
form  other  than  r  (i)  <—  r(J)  O  r  (k). 

If  instr  is  r(i)  *-  c(r(j)),  then  Pm  reads  y  =  conq+f>(g  (q+\)  +  j),  the  two’s  complement 
representation  of  rcong(J).  Pm  and  its  secondary  processors  then  copy  B0(y)  into  Bj(g).  Pm 
also  writes  con^CCy+lX^+l))  in  cq+(>(g  (<7  +1)  +  /')•  We  handle  instruction  instr  of  the 
form  r(i)  4-  r(r(j))  similarly. 

If  instr  is  c(r(i))  4-  r(j),  then  Pm  reads  y  =  conq+6(g  (q  +  l)  +  i),  the  two’s  complement 
representation  of  rcong(i).  Pm  and  its  secondary  processors  then  copy  B; (g )  into  B0(j).  Pm 
also  writes  conq+6(g  (q+\)  +  j)  in  c?+$((y+l)(<7+l)).  We  handle  instruction  instr  of  the 
form  r(r  (/))  4—  r  (j)  similarly. 

If  processors  Pf  and  Pg  of  S'  wish  to  simultaneously  write  c  ( j ),  then  the  corresponding 
processors  Pt  and  Pm  of  Z  will  simultaneously  attempt  to  write  Bq(J).  If  /<  g,  then  /  <  m, 
and  all  secondary  processors  of  Pi  are  numbered  less  than  all  secondary  processors  of  Pm. 
Thus,  in  5',  Pf  succeeds  in  its  write,  and  in  Z,  Pi  and  its  secondary  processors  succeed  in 
their  writes. 

At  most  processors  simultaneously  read  or  write  a  cell  in  the  simulation 
described  above.  Note  that  simultaneous  reads  and  writes  occur  in  two  ways:  (a)  O  (4r  (n)) 
processors  simultaneously  read  or  write  a  cell  at  one  step  of  an  O  {T (n))  step  procedure,  and 
(b)  O  (2r(n))  processors  simultaneously  read  or  write  at  each  step  of  an  O  (T(n))  step 
procedure.  As  a  result,  if  we  wish  to  restrict  the  number  of  simultaneous  reads  and  writes, 
we  can  revise  the  simulation  with  no  time  loss  such  that  all  simultaneous  reads  and  writes  of 


form  (a)  are  modified  to  form  (b). 
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Theorem  6.1.  For  all  T  ( n )  >  log  n,  PRAM  [T ,i]-TIME  ( T ( n ))  c  PRAM  -TIME  ( T2(n )). 

Proof.  In  the  simulation  given  above,  Z  takes  O  ( T ( n ))  time  per  step  of  S'  to  merge  two 
encodings,  compute  new  marked  interesting  bits,  and  compress  the  list  into  the  proper  MIB 
form.  S'  simulates  S  in  O  (T (n))  time.  Hence,  Z  takes  0  (T2(n))  time  to  simulate  S  via  S'. 
□ 

Corollary  6.1.1.  PRAM  [T,l ]-POLYLOGTIME  =  PRAM -POLYLOGTIME. 

6.2.  Simulations  of  PRAM[t,i]  by  Circuits  and  Turing  Machine 

We  now  describe  simulations  of  a  PRAM[T,i]  by  a  log-space  uniform  family  of 
unbounded  fan-in  circuits,  a  log-space  uniform  family  of  bounded  fan-in  circuits,  and  a 
Turing  machine. 

Lemma  6.2.1.  For  each  n,  every  language  recognized  by  a  PRAM[t,l]  S  in  time  T (n)  with 
P  (n)  processors  can  be  recognized  by  a  log-space  uniform  unbounded  fan-in  circuit  C„  of 
depth  O  (T2(n))  and  size  0(P4(n)T4(n)16ri<n)). 

Proof.  Let  Z  be  the  PRAM  described  in  Theorem  6.1,  simulating  S  in  O  ( T2(n ))  time  with 
O  ( P2{n)T (n) 47  (n))  processors.  Fix  an  input  length  n.  We  construct  an  unbounded  fan-in 
circuit  C„  that  simulates  Z  by  the  construction  given  by  Theorem  2.1.  U 

Lemma  6.2.2.  For  each  n,  every  language  recognized  by  a  PRAM[1\i]  S  in  time  T  ( n )  with 
P  ( n )  processors  can  be  recognized  by  a  log-space  uniform  unbounded  fan-in  circuit  UCn  of 
depth  0(T2(n)),  size  O  (P4(n)7'4(rt)16r2(',>),  and  maximum  fan-in  0(4r(n)T2(n)). 

Proof.  Fix  an  input  length  n.  We  construct  UCn  from  C„  of  Lemma  6.2.1.  We  reduce  the 
fan-in  in  the  portions  of  the  circuit  that  simulate  updates  in  the  shared  memory  of  Z.  The 
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circuit  described  in  Theorem  2. 1  allows  all  processors  to  attempt  to  simultaneously  write  the 
same  cell.  This  does  not  occur  in  Z.  During  the  execution  of  each  procedure  of  Z,  either 
4T1(n)  secondary  processors  concurrently  write  the  same  cell  once  or  4T(n)  secondary 
processors  concurrently  write  the  same  cell  at  each  of  0  ( T  ( n ))  levels  of  recursion.  Thus,  we 
can  modify  Z  such  that  at  most  4r(n)  processors  attempt  to  write  the  same  cell  at  each  time 
step,  keeping  the  rime  for  each  procedure  a;  O  ( T  («)).  By  the  construction  given  in  Theorem 
2.1,  this  leads  to  a  maximum  fan-in  for  any  gate  in  UCn  of  O  (4r(n)T2(/i))  if  T(n)>  n  or 
q  ^T(n)j („)„)  jf  7 <  n  The  circuit  remains  uniform  after  modifications  to  Z  because 
the  processors  concurrently  writing  are  all  secondary  processors  belonging  to  the  same 
primary  processor.  UCn  has  depth  O  ( T2(n ))  and  size 
O (P 2 {n )T4 ( n )4ri(n) (T ( n)  +  P2(n )4THn ' ))  =  O (P 4 (n )T* (n )  1 6r2(n) ).  □ 

Lemma  6.2.3.  For  each  n,  every  language  recognized  by  a  PRAM[t,l]  5  in  time  T ( n )  with 
P  ( n )  processors  can  be  recognized  by  a  log-space  uniform  bounded  fan-in  circuit  BCn  of 
depth  0(T3(n))  and  size  O  (P4(n)T4(n)16r2(n)). 

Proof.  Fix  an  input  length  n.  Let  UCn  be  the  unbounded  fan-in  circuit  described  in  Lemma 
6.2.2  that  simulates  5.  The  gates  of  UCn  have  maximum  fan-in  of  O  (2 r(n)T2(n))  if 
T(n)>n  or  0(2T{n)T (n)n)  if  T(n)  <  n.  We  construct  the  bounded  fan-in  circuit  BCn  by 
replacing  each  gate  of  UCn  with  fan-in  /  by  a  tree  of  gates  of  depth  log  /.  Since  every 
f  -  O  (2T in) {T2 (n)  +  nT (n)),  and  T(n)  >  log  n,  BCn  can  simulate  each  gate  of  UCn  in  depth 
0(T(n)).  Since  UCn  has  depth  0((T2(n))  by  Lemma  6.2  2,  BCn  has  depth  0(T2(n)).  □ 

Theorem  6.2,  For  all  T(n)  >  log  n ,  PRAM [T,i]-77M£(7(n))  ^DSPACE(T\n)). 

Proof.  Theorem  6.2  follows  from  Lemma  6.2.3  and  Borodin’s  (1977)  result  that  a  bounded 
fan-in  circuit  of  depth  D  ( n )  can  be  simulated  in  space  O  (D  ( n ))  on  a  Turing  machine.  □ 
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Theorem  6.2  and  a  fundamental  result  of  Fortune  and  Wyllie  (1978) 

DSP  ACE  (T{n))  £  PRAM -TIME  (T(n))  for  all  T(n)  >  log  n 
together  imply  that  PRAM  [T,i  ]-TlME  ( T  ( n ))  c  PRAM -TIME  (T2(n)).  The  direct 
simulation  of  Theorem  6.1  is  more  efficient. 

Theorem  6.1  and  the  other  fundamental  result  of  Fortune  and  Wyllie 

PRAM  -TIME  (T(n))  C  DSPACE  (T2(n))  for  all  T(n)  >  log  n 
together  imply  that  PRAM  [T,l ]-TIME  (T(n))  c  DSPACE  (T4(n)).  The  O  (T3(n))  space 
simulation  of  Theorem  6.2  is  more  efficient. 

Corollary  6.2.1.  PRAM[T  ,i]~PTIME  =  PSP  ACE. 

6.3.  Direct  Simulation  of  PRAM[t,i]  by  Turing  Machine 

In  the  previous  section,  we  indirectly  simulated  a  PRAM[T,i]  by  a  Turing  machine. 
Here,  we  present  a  direct  simulation  that  achieves  the  same  space  bound. 

We  use  the  interesting  bit  (IB)  encoding  of  Simon  (1977).  Let  d  be  an  integer,  and  let  w 
=  len  (d).  Let  bw_i  ■  •  •  bo  be  the  w-bit  two’s  complement  representation  of  d.  We  define  the 
interesting  bit  encoding  as 
1(0)  =  0, 

KOI)  =  1, 

1(d)  =  (I(a*), ... ,  I(a2)>  I(ai);  r), 

where  d  is  an  integer,  aj  is  the  position  of  the  ;'th  interesting  bit  of  d ,  and  r  is  the  value  (0  or 
1 )  of  the  rightmost  bit  of  d. 

For  example,  1(01100)  =  (1(011),  1(01);  0)  =  ((1(01);  1),  1;  0)  =  ((1;  1),  1;  0).  For  all  d ,  define 
val(l(d))  =  d  and  val(I(d))  =  d,  that  is,  val  returns  the  complement  of  the  two’s  complement 
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representation  of  d.  Observe  that  the  IB  encoding  is  simply  the  MIB  encoding  without  the 
marks. 

We  simulate  a  PRAM[t,l]  5  running  in  time  T(n)  by  a  TM  running  in  space 
polynomial  in  T(n)  by  writing  only  pointers  into  the  encodings  of  cell  contents.  This 
manipulation  of  pointers  to  individual  symbols  in  the  encoding  is  similar  to  the  manipulation 
of  individual  bits  in  the  simulation  of  a  PRAMf*]  by  a  TM  in  Section  4.3.  During  the 
computation  of  S  on  any  input  of  length  n,  for  every  c  (J),  the  length  of  V ''on  (J))  is  at  most 
exponential  in  T2(n),  and  pointers  have  length  at  most  T2{n)  (Lemma  6.1.3).  We  invoke  the 
Associative  Memory  Lemma  to  construct  a  PRAM[T,i]  S'  that  simulates  S  using  only  short 
addresses. 

Simulation.  We  now  describe  the  simulation  of  a  time-bounded  PRAM[T,X]  by  a 
space-bounded  TM.  Let  S  be  a  PRAM[t,X]  that  uses  T (n )  time  and  P  (n)  processors.  Let  S' 
be  a  PRAM[1\X]  that  uses  only  short  addresses  and  simulates  S  according  to  the  Associative 
Memory  Lemma.  Thus,  5'  uses  O  ( P2(n)T («))  processors,  O  ( T (n))  time,  and  only 
addresses  in  0,  1, ...,  0(P(n)T(n)). 

We  construct  a  TM  M  that  simulates  5  via  S'  in  T2(n)  space.  M  uses  four  mutually 
recursive  procedures:  PCOUNTER,  COMPARE ,  SYMBOL,  and  ADD.  In  the  following,  a 
(and  (3)  may  have  any  of  the  following  forms:  (i)  #d,  where  d  is  a  constant,  (ii)  j,  where  j  is  a 
register  address  and  I(a')  is  I (rconm(J)),  (iii)  /.<(>,  where  0  is  a  pointer  and  I(a')  is 
I (rconm(J)).§,  (iv)  1  +  y.0,  where  I(a')  is  1(1  +  val(l(rconm(j)).$)),  or  (v)  1  +  y.0,  where  I(a') 
is  1(1  +  val  (l(rconm(y)).0)).  During  the  simulation,  every  #d,  y,  and  y.0  parameter  can  be 
written  in  O  ( T2(n ))  space. 
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PCOUNTER  (m,  t )  returns  the  contents  of  the  program  counter  of  Pm  at  time  r.  To 
determine  whether  5'  accepts  input  to,  M  executes  PCOUNTER  to  check  whether  P  o  halts 
with  its  program  counter  on  an  ACCEPT  instruction  by  time  0  ( T (n)).  Let  p  be  the  value 
returned  by  PCOUNTER  (m,  t).  PCOUNTER  (m,  t)  depends  on  r,  the  value  returned  by 
PCOUNTER  (m,  r— 1).  If  r  indicates  that  Pm  was  not  active  at  time  f-1,  then  PCOUNTER 
determines  whether  P activated  Pm  at  time  f-1  with  a  FORK  instruction  by  calling 
PCOUNTER  ([mil]  ,  f-1).  If  P[m/2\  executed  FORK  label  1,  label 2,  then  if  m  is  even, 
p  =  label  1;  otherwise,  p  =  label  2.  If  P\_ma J  did  not  execute  FORK,  then  Pm  is  inactive  at 
time  f.  If  r  indicates  that  Pm  is  active  at  time  f-1  and  step  r  is  not  a  CJUMP ,  REJECT ,  or 
ACCEPT  instruction,  then p  =  r+1.  If  step  r  is  CJUMP  r(i)  comp  r(j),  labels,  where  comp 
is  an  integer  comparison,  then  PCOUNTER  repeatedly  calls  VALUE  for  time  f-1  to  compare 
rconm(i)  and  rconm(J).  If  the  comparison  is  true,  then  p  =  label  3;  otherwise,  p  =  r+1.  If 
instruction  r  is  an  ACCEPT  (REJECT),  then  p  -r. 

In  the  following  procedures,  if  m  =  X,  then  interpret  a  and  (3  as  referring  to  shared 
memory  cells;  if  m  *  X,  then  interpret  a  and  |3  as  referring  to  registers  of  Pm.  We  describe 
the  case  m  *  \. 

COMPARE  (a,  ,  p,  \jf2,  m,  t )  compares  the  value  of  subtree  I(a').\yi  at  time  r  (of  the 

computation  of  S')  with  the  value  of  subtree  I(P').y2  at  time  t.  COMPARE  returns  "equal"  if 
va/(I(a').\yi)  =  va/(I(P').\y2),  "greater"  if  va/(I(a').\yi)  >  va/(I((3').\|/2),  or  "less"  if 
va/(I(a').vj/i)  <  va/(I(J3').\|/2).  COMPARE  recursively  compares  the  subtrees  of  Ifa'l/vj/]  and 
I(P').y2  from  right  to  left.  COMPARE  calls  SYMBOL  to  determine  whether  the  elements 
considered  are  subtrees  or  leaves,  and,  if  they  are  leaves,  to  determine  the  symbol  at  the  leaf. 
The  interested  reader  is  referred  to  Appendix  C  for  the  details  of  COMPARE. 
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SYMBOL  (a,  y,  m,  t )  returns  the  symbol  I(a').\y  if  y  points  to  a  leaf  of  I(a'); 
otherwise,  SYMBOL  returns  a  signal  that  y  points  to  a  subtree  of  I(a'). 

Assume  a  has  the  form  i.  SYMBOL  calls  PCOUNTER  to  determine  whether  Pm  wrote 
register  rm(i)  at  time  r-1.  If  Pm  did  not  write  rm(i),  then  SYMBOL  returns 
SYMBOL ( i ,  y,  m,  r-1).  Otherwise,  suppose  Pm  executed  an  instruction  instr  that  wrote 
rm(i)  at  time  r-1. 

If  instr  was  r  ( i)*-d ,  then  this  is  a  base  case.  If  y  points  to  a  leaf  of  the  encoding  of  the 
constant  d,  then  SYMBOL  returns  /  (d).y  ;  otherwise,  SYMBOL  returns  a  signal  that  y  points 
to  a  subtree. 

If  instr  was  r(i)<-r(j),  then  SYMBOL  returns  SYMBOLij,  y,  m,  r-1).  Ifr  =  0,  then 
SYMBOLS,  y,  m,  r)  is  a  base  case.  M can  determine  I (rconm(i)).\\r  because  rconm( 0)  at 
time  0  is  m\  all  other  registers  contain  0,  and  M  has  space  to  write  the  processor  number  in 
encoded  form.  (Note:  If  m  =  X,  then  r  =  0  is  again  a  base  case.  At  time  0,  con{ 0)  is  the 
input,  all  other  cells  contain  0,  and  M  has  space  to  write  the  input  in  encoded  form.) 

For  Boolean  operations,  SYMBOL  is  straightforward.  See  Appendix  D  for  details. 

If  instr  was  r  (i)^-r(J)  +  r(k),  then  SYMBOL  returns  ADD  (J,  k ,  y,  m,  r-1),  which 
returns  I (rconm(J)  +  rconm(k)).\\f. 

If  instr  was  r(j)<— r(y)  -  r(k),  then  SYMBOL  returns  ADD  (J,  l+k,\\r,m,  r-1).  Recall 
that  l  +  rconm(k)  is  the  two’s  complement  of  rconm(k). 

If  rconm(k)  <  0  and  instr  was  r{i)*—r(j)  T  r(k),  then  M  treats  instr  as 
r(J)  J'  r(k),  substituting  \rconm(k)\  for  rconm(k).  Similarly,  if  rconm{k)  <  0  and  instr 
was  r(i)*—r(j)  1  r(k),  then  M  treats  instr  as  r(i)*—r(J)  T  r(k),  substituting  I  rconm{k)  I  for 
rconm{k).  Thus,  for  both  shift  instructions,  we  shall  assume  rcong(k)  >  0. 
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If  instr  was  r(i)<^r(j)  T  r  (it),  then  the  vth  interesting  bit  of  rconm{i)  will  be  in  the 
position  specified  by  the  sum  of  rconm(k)  and  the  position  of  the  vth  (if  the  rightmost  bit  of 
rconm(J)  is  0)  or  v+lst  (if  the  rightmost  bit  of  rconm(j)  is  1)  interesting  bit  of  rconm{J).  For 
y  =  .ro-*i.  •  •  let  FIRST  (y)  =x0  and  REST  (y)  =xx  jc2.  ■  *  *  SYMBOL  returns  either 
ADD  (j. FIRST  {y),  k,  REST(y),  m,  r-1)  or  ADD  (J.  (FIRST  (y)-\),  k,  REST (\jf),  m,  t- 1). 

If  instr  was  r(z')*-r  (J)  1  r(k),  then  if  rconm(k)  is  greater  than  the  position  of  the 
leftmost  interesting  bit  of  rconm(j),  then  SYMBOL  returns  0;  otherwise,  the  ith  interesting  bit 
of  rconm(i)  will  be  the  ith  interesting  bit  of  rconm(J)  such  that  the  difference  between  the 
value  of  its  position  and  rconm(k )  is  nonnegative.  The  interesting  bit  of  rconm(i)  will  be  in 
the  position  specified  by  the  difference  between  the  value  of  the  position  of  this  interesting 
bit  of  rconm(J)  and  rconm(k).  SYMBOL  calls  ADD  to  find  this  information. 

If  instr  used  an  indirect  address,  say  c  ( r(J )),  then  M  uses  SYMBOL  to  get  I (rconm(j)) 
one  symbol  at  a  time.  Next,  M  decodes  I (rconm(j))  to  get  rconm(J).  Since  all  addresses  used 
by  S'  are  O  (T (n))  long,  M  has  space  to  write  rconm(j).  Now  SYMBOL  can  directly  access 
the  desired  cell.  If  the  indirect  address  was  r  (r(g)),  then  M  reads  rconm(g )  one  symbol  at  a 
time.  Since  each  processor  in  S'  uses  only  a  constant  number  of  registers,  len  ( rconm(g ))  is  a 
constant,  and  M  has  space  to  write  the  address. 

ADD  (a,  (3,  y,  m,  t)  returns  the  symbol  to  which  y  points  in  I(va/(I(a'))  +  va/(I(p')))  if 
\|/  points  to  a  leaf  in  the  encoding  of  the  sum;  otherwise,  ADD  returns  a  signal  that  y  points 
to  a  subtree  of  the  encoding  of  the  sum.  ADD  computes  subtrees  of  I(va/(I(a'))  +  va/(I((3' ))) 
from  right  to  left  and  based  on  cases  depending  on  whether  va/(I(a'))  and  va/(I(p'))  are  in 
constant  intervals  of  0’s  or  l’s  and  on  the  carry  from  the  previous  bit  position.  Note  that  this 
procedure  also  works  when  J3  is  of  the  form  1  +k.  See  Appendix  E  for  details. 
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Theorem  6.3.  For  ail  T(n)  >  log  n,  PRAM[l,l]~TIME(T(n))  zDSPACE(T2(n)). 

Proof.  The  simulation  given  above  simulates  a  PRAM[T,4.]  5  by  a  TM  M. 

For  each  invocation  of  PCOUNTER,  COMPARE ,  SYMBOL ,  and  ADD ,  M  can  write  all 
variables  and  parameters  in  space  0(log/>(/t)T(n)),  0(T2(n)),  0(T2(n)),  and  0(T2(n)), 
respectively.  The  depth  of  recursion  of  these  procedures  is  at  most  O  ( T  ( n )).  Since 
P{n)  <  2™,  M  uses  space  0(T3(n))  to  simulate  S.  With  linear  space  compression^  M  uses 
space  r3(n).  □ 
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Chapter  7.  Multiplication  and  Shift 

We  study  the  interaction  of  multiplication  and  shift  instructions  in  this  chapter. 
Combined,  they  can  produce  very  long  and  complex  numbers.  The  product  of  two  integers 
with  b  and  b'  interesting  bits  can  have  bb’  interesting  bits.  Thus  with  s  interleaved  shift  and 
multiplication  operations,  a  PRAM[*,T]  can  build  numbers  with  interesting  bits. 

Let  NEXPTIME  be  the  class  of  languages  accepted  by  a  nondeterministic  Turing 
machine  in  time  O  ( cpoly(-n) ),  where  c  is  a  constant  and  poly  ( n )  is  a  polynomial  in  n;  let 
EXPSPACE  be  the  class  of  languages  accepted  by  a  Turing  machine  in  space  O  (cpoly{n)).  c 
a  constant.  In  Section  7.1,  we  prove  NEXPTIME  c  PRAM  [*,  T]-PTIME,  and  in  Section  7.2, 
we  prove  PRAM  [*,  T ,i]-PTIME  c  EXPSPACE.  We  have  previously  shown  that 
polynomial  time  on  a  PRAM[*]  or  a  PRAM[1\1]  is  equal  to  PSP  ACE  on  a  Turing  machine. 
Thus,  a  PRAM  with  both  multiplication  and  shift  instructions  may  be  more  powerful,  to 
within  a  polynomial  in  time,  than  a  PRAM  with  either  multiplication  or  shift  alone,  since  it  is 
believed  that  NEXPTIME  properly  contains  PSP  ACE. 

7.1.  Simulation  of  TM  by  PRAM[*,t] 

We  present  here  a  simulation  of  a  nondeterministic  Turing  machine  (NTM)  running  in 
exponential  time  by  a  PRAM[*,T]  running  in  polynomial  time.  Our  strengthening  of  the 
simulation  of  a  Turing  machine  from  PSPACE  to  NEXPTIME  relies  on  an  interaction 
between  multiplication  and  shift  operations.  For  an  integer  v,  let  #v  denote  its  two’s 
complement  representation;  the  number  of  bits  in  the  two’s  complement  representation  will 
be  clear  from  the  context.  As  previously  noted,  the  shift  operation  is  useful  for  making 
copies  of  strings  in  a  single  cell.  Multiplication  can  make  copies  much  more  quickly.  If  v  is 
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an  integer  such  that  l’s  are  spaced  widely  enough  in  #v,  then  multiplying  an  integer  u  by  an 
integer  v  produces  an  integer  w  such  that  contains  a  copy  of  #u  for  every  1  in  #v. 

We  simulate  a  multitape  NTM;  the  input  initially  appears  on  one  of  the  tapes.  A 
configuration  of  an  NTM  Q  comprises  the  contents  of  its  tapes,  the  position  of  its  tape  heads, 
and  the  state  of  its  finite-state  control.  Let  state  (ct)  denote  the  state  of  Q  in  configuration  a. 
For  two  configurations  a  and  x  of  Q,  the  relation  a  (-  x  holds  if  Q  in  configuration  a  may,  in 
one  step,  make  a  transition  to  configuration  x  according  to  its  transition  rules.  A  transition 
from  configuration  a  to  configuration  x  is  valid  if  o  h  x  holds.  A  computation  of  Q  running 
for  T (n)  steps  is  a  sequence  of  T (n)  +  1  configurations  C  =  CqC]  •  •  •  where  the  tth 
configuration,  for  all  0  <  i  <  T  (n),  describes  Q  after  the  ith  step,  or,-  h  o1+i ,  and  rio  is  the 
initial  configuration  of  Q  with  input  co.  Computation  C  is  accepting  if  state  (apw)  is 
accepting  state. 

A  neighborhood  in  a  configuration  comprises  the  contents  of  two  adjacent  tape  squares, 
one  of  which  a  tape  head  is  reading,  the  location  of  the  tape  head  on  one  of  the  squares,  and 
the  state  of  the  TM.  Let  /V,/(o)  (N,r(c))  denote  the  neighborhood  in  configuration  a  in  which 
the  tape  head  of  the  ith  tape  is  on  the  left  (right)  square.  For  a  one-tape  NTM,  we  simply  use 
N/(c)  and  Nr(a). 

We  assume  that  every  TM  worktape  is  one-way  infinite  to  the  left.  We  make  this 
assumption  in  order  to  simplify  the  relationship  between  a  TM  configuration  and  an 
encoding  of  that  configuration  built  on  a  PRAM[*,t],  since  the  PRAM[*,T]  builds  numbers 
that  increase  from  left  to  right. 

Let  us  outline  the  simulation.  The  simulating  PRAM[*,t],  MS,  generates  in  a  single 
cell  a  description  of  all  possible  sequences  of  configurations  by  a  time  bounded  NTM.  MS 
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then  tests  each  of  these  configuration  sequences  to  determine  whether  at  least  one  of  them  is 
an  accepting  computation  of  the  NTM  on  the  given  input.  If  so,  then  MS  accepts  the  input; 
otherwise,  MS  rejects.  Multiplication  and  shift  interact  to  quickly  generate  the  description  of 
all  possible  computations. 

•  Oblivious  motion 

To  test  the  validity  of  a  possible  computation,  MS  must  be  able  to  easily  locate  the  tape 
head  in  each  configuration.  For  this  reason,  we  have  MS  simulate  an  oblivious  Turing 
machine,  where  the  head  motion  is  regular  and  does  not  depend  on  the  input  or  tape  contents. 
NEXPT1ME  can  be  characterized  by  oblivious  simple  NTMs  working  in  exponential  time. 

Let  Q  be  an  NTM  running  in  time  T  ( n )  with  a  constant  number  of  worktapes,  one 
read-write  head  per  tape,  and  with  the  input  string  co  initially  written  on  one  worktape.  Since 
Q  runs  for  T (n)  steps,  it  uses  at  most  T (n)  space  on  each  tape.  We  construct  an  oblivious 
Turing  machine  Q'  that  simulates  Q.  The  simulator  Q'  has  a  single  worktape  T(n)  +  2 
squares  long,  with  a  special  endmarker  at  each  end.  Q'  will  be  a  sweeping  NTM.  Its  head 
moves  one  square  at  every  step  and  does  not  change  direction  except  at  the  endmarkers,  so 
the  head  motion  is  a  sequence  of  one-way  sweeps  back  and  forth  across  the  worktape.  The 
tape  head  of  Q '  halts  to  accept  or  reject  the  input  only  at  one  of  the  endmarkers.  We 
partition  the  set  of  states  of  Q '  into  two  sets,  R  and  L.  R  denotes  the  set  of  states  0  in  which 
the  tape  head  moves  right,  unless  the  symbol  being  read  is  the  right  endmarker.  L  denotes 
the  set  of  states  0  in  which  the  tape  head  moves  left,  unless  the  symbol  being  read  is  the  left 
endmarker. 

Lemma  7.1.1.  Q '  simulates  Q  in  O  ( T2(n ))  time. 
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Proof.  We  describe  the  simulation  of  a  general  step  of  Q  by  Q' .  We  describe  the  simulation 
for  Q  with  a  single  worktape.  To  generalize  the  simulation  to  the  case  where  Q  has  multiple 
tapes,  Q '  uses  multiple  tracks,  one  per  tape  of  Q  Suppose  Q'  is  in  a  right  sweep;  that  is,  Q' 
is  in  a  state  §q'  e  R.  The  case  where  Q’  is  in  a  left  sweep  is  handled  similarly. 

Suppose  the  simulated  machine  Q  at  this  time  is  in  state  §q  and  that  Q  writes  a  symbol 
s  on  the  tape  square  currently  being  read,  nondeterministically  selects  a  next  state  x<2>  and 
moves  to  the  right.  The  simulating  oblivious  sweeping  NTM  Q'  writes  s,  selects 
corresponding  next  state  Xq'<  and  moves  to  the  right.  This  completes  the  simulation  of  this 
step  of  Q. 

Suppose  instead  when  Q  is  in  state  <>2  that  Q  writes  5,  nondeterministically  selects  next 
state  t \/q,  and  moves  to  the  left.  In  this  case,  Q'  writes  symbol  s*  as  a  marker  for  this 
position  on  the  tape,  selects  state  \\Irq  €  R,  and  moves  to  the  right.  On  successive  steps,  Q ' 
continues  to  sweep  to  the  right  in  state  \\fpQ’  without  writing  until  it  reaches  the  right 
endmarker.  Q'  then  enters  state  e  L  and  begins  a  left  sweep.  Q'  sweeps  left  in  state 
\\>Lq  without  writing  until  it  reads  symbol  5  *  At  this  point,  Q '  writes  symbol  s,  enters  state 
u/y  e  L,  and  moves  to  the  left.  Now  Q '  is  ready  to  simulate  the  next  move  ot  Q.  This  takes 
0(T (n))  steps. 

If  the  tape  head  of  Q  remains  stationary,  then  Q '  simulates  this  step  similarly  to  a  left 
head  motion  in  O  (Tin ))  steps 

Thus,  Q'  simulates  each  step  of  Q  in  O(Tin))  steps,  and  the  entire  computation  of  Q  in 
0(T2(n ))  steps  □ 

We  construct  a  PRA.Vl|*,t],  V/5,  that  simulates  Q  via  Q‘  in  (9(log  T  (n  i)  time. 
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•  Construction  of  a  description  of  all  configuration  sequences  of  Q 

Let  d  denote  the  number  of  bits  needed  to  encode  each  symbol  in  the  tape  alphabet. 

Let  a  be  a  configuration  of  a  one-tape  NTM.  Let  rj  [  be  a  bit  string  describing  the  contents  of 
the  worktape  in  c  with  d  bits  per  symbol  from  the  right  end  to  the  square  that  the  tape  head 
is  reading  (but  not  including  the  contents  of  that  square);  let  r|2  be  a  bit  string  describing  the 
contents  of  the  worktape  from  the  square  that  the  tape  is  reading  to  the  left  end  of  the  tape; 
let  0  denote  an  encoding  of  state  (a).  We  encode  a  configuration  c  by  an  integer  p.  such  that 

#(I  =  Tl2<t>Tli. 

Since  Q  runs  for  O  (T2(n ))  time  on  T (n )  +  2  space,  each  configuration  of  Q  can  be 
described  in  O  (T  (n ))  bits,  and  each  computation  of  Q  can  be  described  in  O  (T2(n  ))  bits. 

Let  Im  denote  the  integer  such  that  #Im  is  the  concatenation  of  all  bit  strings  m  bits  long. 
Specifically, 

2" -I 

lm  = 

For  an  integer*,  where  #*  =  by  ■  ■  ■  bxb 0,  for  all  0  <  /'  <  [ylm  J  ,  we  call  b(i+ 1)m_i  ■  •  •  bim,  a 
slot.  For  m  =  0  (T3(n )),  MS  will  generate  Im  as  a  list  of  all  strings  that  can  possibly 
describe  a  computation  of  Q  on  input  to.  We  view  #/m  as  the  concatenation  of  2m  slots, 
each  of  which  represents  a  sequence  of  configurations  of  Q  . 

We  defined  lm  as  the  sum  of  2m  terms.  We  want  MS  to  generate  lm  in  O  (log  m  )  time. 
Therefore,  we  cannot  simply  activate  a  processor  to  build  each  term,  then  sum  the  terms 
because  this  process  takes  O  (m  )  time. 

2m  ~  i 

Let  mask,  =  ^  2m+t ;  that  is,  ttmask,  has  a  1  in  the  j  th  bit  position,  of  each  slot.  Let 
Sj  =  Im  A  mask, ;  that  is,  ttS,  and  ttlm  are  equal  in  the  j  th  bit  position  of  each  slot,  and  #S,  is 
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0  elsewhere.  Recall  that  slot  k  of  #Im  contains  #k,  so  the  value  of  the  yth  position  in  slot  k  of 
#S .  is  equal  to  the  value  of  the  yth  bit  position  of  #k. 

m  - 1 

MS  generates  Im  by  building  each  SJt  0  <  j  ^  m  -1,  then  combining  them:  Im  -  V 

i= o 


Let  us  now  describe  how  MS  constructs  the  S/s.  Although  we  have  defined  S,  in  terms  of 
lm,  we  utilize  an  alternative  definition  to  explain  our  construction  of  S}.  Let  b  =  2m/2y+1 . 

Letnver;  =  *£2(2‘+1Vn2J+y  =  £,2‘m2'*1+m2'  +  y  and  bayou,  =  V2*™-  Thus,  S,  = 


river  j- bayou,.  If  we  look  only  at  the  yth  bit  position  of  each  slot  of  #lm  (or  equivalently, 
#S/,  we  find  alternating  sequences  of  V  0’s  and  2J  l’s.  Our  second  definition  of  S,  reflects 
this:  we  place  a  1  in  the  yth  position  of  each  (2y+1)th  slot  {river j),  then  multiply  by  an 
integer  whose  two’s  complement  representation  has  2J  1  ’s,  appropriately  spaced  ( bavouj ). 


To  generate  each  river r  MS  combines  a  set  of  values  called  streamk.  We  now  define 
stream k  and  tell  how  MS  generates  the  streamk s  and  how  MS  uses  them  to  build  the  river ; -s. 
For  all  0  <  k  <  m  -  1,  streamk  =  2mlk  +  1  =  [1  T  (m- 2*)]  +  1;  that  is,  #streamk  has  a  1  in  the 
rightmost  position  of  slot  2*  and  a  1  in  the  rightmost  position  of  slot  0.  MS  activates  m 
processors,  Pm, ....  O  (log  m)  steps.  Processor  Pm+k,  1  <k<  m-1,  computes 

streamk  in  constant  time. 

b~l  , 

Next,  we  generate  'river r  a  value  one  step  from  river r  Define  'river t  -  £  2  ;  that 

i=0 

ts,  'river,  is  river ,  shifted  right  so  that  the  least  significant  1  is  in  the  0th  bit  position: 

' river ;  =  river,  I  (m  2J  +  j).  We  build  'river JH  0  <  y  <  m-2,  as  the  product  of  m  -  y  -  1 

m-l 

streamk' s:  'river,  =  J”[  stream k.  We  want  to  compute  all  'river,' s  in  O  (log  m)  time.  This 
*=/+) 

is  a  parallel  postfix  computation,  which  MS  can  perform  in  time  O  (log  m)  with  m  processors 


(Ladner  and  Fischer,  1980). 
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We  now  have  'river j  for  0  <  j  <  m-2.  The  remaining  item,  'river m_x ,  is  simply  1. 

Now  for  all  0  <  j  <  m-1,  we  obtain  river /  from  '  river  j  in  constant  time:  river /  <— 

'  river  j  T  (m21  +  j). 

MS  next  computes  the  bayoUj' s.  Each  bayouj  is  the  product  of  j  easily  computed 
terms: 

bayouj  =  2£2*m  =  fJ2(m/2)2<  +  1. 

k= 0  j=1 

This  is  a  parallel  prefix  computation.  Once  again,  MS  computes  all  bayouj' s  simultaneously 
in  (9  (log  m)  time  (Ladner  and  Fischer,  1980). 

For  all  j,  0  <  j  <  m-1,  processor  Pm+J  computes  Sj  =  river  *bayouj.  MS  now  computes 

m-1 

lm  -  V  5,  in  O  (log  m)  time. 

;=0 

Since  m-0  (T3(/i)),  MS  computes  Ir\n)  *n  O  (log  T  («))  time. 

•  Testing  for  a  valid  computation 

We  view  /r3(*)  as  a  list  of  all  possible  configuration  sequences.  Each  sequence  is 
O  ( T2(n ))  configurations  long,  and  each  configuration  is  O  ( T ( n ))  bits  long.  Thus,  Ip(n)  is  a 
list  of  all  possible  descriptions  of  a  computation  of  the  oblivious  sweeping  NTM  Q’  on  the 
input.  MS  must  now  test  whether  at  least  one  of  the  configuraticr.  sequences  in  #/r3(n) 
represents  an  accepting  computation  of  Q'.  MS  will  first  build  a  set  of  bit  masks  to  be  used 
in  tlK  :esting,  then  for  each  configuration  sequence,  MS  will  evaluate  the  following: 

Test  1 :  whether  the  transition  from  each  configuration  to  the  next  is  valid. 

Test  2:  whether  the  first  configuration  corresponds  to  the  initial  configuration  of 
Q',  and 
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Test  3:  whether  Q'  is  in  an  accepting  state  in  the  last  configuration. 

If  all  tests  are  true  for  a  configuration  sequence,  then  MS  accepts;  if  for  all  configuration 
sequences,  at  least  one  test  fails,  then  MS  rejects. 

Recall  that  Q'  runs  for  O  (' T2(n ))  steps  on  T(n)  +  2  space,  so  a  computation  comprises 
p  =  0(T2(n))  configurations,  where  each  configuration  is/=  0(T (n))  bits  long.  The  tape 
head  of  Q '  sweeps  back  and  forth  between  the  endmarkers:  the  tape  head  makes  7  (n )  +  1 
moves  from  one  endmarker  to  the  next,  then  reverses  direction.  Hence,  for 
0  <  i  <p  1(2 7 (n)+ 2)  and  0  <  j  <  T(n),  in  configurations  numbered  i (2 7 (n)+ 2)  +  j,  the  head 
is  located  one  square  to  the  left  in  the  following  configuration,  and  in  configurations 
numbered  i  (27 (n>+2)  +  T(n)  +  1  +  j,  the  head  is  located  one  square  to  the  right  in  the 
following  configuration. 

We  interpret  each  slot  of  ttlm  as  a  representation  of  a  configuration  sequence  of  Q' . 
Further,  we  view  each  slot  of  #lm  as  the  concatenation  of  p  notches,  where  each  notch 
represents  a  configuration  of  Q' .  Let  config  (J)  denote  the  contents  of  notch  j  interpreted  as  a 
configuration.  Let  us  call  the  notches  that  hold  conngurations  from  which  the  tape  head  is  to 
move  nght  (left)  right  notches  ( left  notches).  For  0  <  i  <p  1(2 T (n)+ 2)  and  0  <  k  <  Tin),  left 
notches  are  numbered  i  (27 (n)+ 2)  +  k;  right  notches  are  numbered 
i  (27 (n)+  2)  +  T(n)  +  1  +  k.  (Right  and  left  notches  occur  in  alternating  sequences  of 
T(n)+  1  notches.)  A  right  (left)  notch  describes  one  configuration  in  a  configuration 
sequence  during  a  right  (left)  sweep  of  Q' . 

Recall  that  Nt(o)  (Nr(c))  denotes  the  neighborhood  in  configuration  c  in  which  the  tape 
head  is  on  the  left  (nght)  square.  Given  the  contents  of  two  adjacent  notches  that 
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represent  configurations  0;  and  Cj+i,  where  the  tape  head  of  Q  is  in  square  q  in 
configuration  a, ,  MS  must  check  the  following  to  determine  whether  a,  l-a1+1:  ( 1 )  MS 
must  compare  N/(a, )  and  Nr  (a1+i)  (if  state  (Gi )  e  R  )  or  (Vr(o, )  and  N/(o,  +i)  (if  state  (o, )  e 
L )  to  determine  whether  the  neighborhoods  around  the  tape  head  in  a,  and  o1+1  represent  a 
valid  transition  of  Q'  and  (2)  MS  must  check  that  the  remainder  of  the  configuration  is 
unchanged.  To  perform  Test  1,  MS  builds  four  values  to  be  used  as  bit  masks:  Lmask o, 
Lmask  i,  Rmasko,  and  Rmask\.  For  the  number  j  of  a  le  t  (respectively,  right)  notch, 
ff  Lmasko  (respectively,  ff  Rmask  i)  will  have  1  's  in  the  bit  positions  corresponding  to 
Nr  (config  (J ))  if  j  is  even  and  Ni  ( config  (J ))  if  j  is  odd,  and  the  other  notches  will  be  all 
0’s.  For  the  number  j  of  a  left  (respectively,  right)  notch,  #Lmask\  (respectively,  ftRmask q) 
will  have  l’s  in  the  bit  positions  corresponding  to  Ni(config  (J))  if  j  is  even  and 
Nr  ( config  (j ))  if  j  is  odd,  and  the  other  notches  will  be  all  0’s.  Thus,  Lmasko  and  Lmask  j 
test  transitions  from  the  left  notches  (transitions  in  a  left  sweep),  and  Rmask 0  and  Rmask  i 
test  transitions  from  the  right  notches  (transitions  in  a  right  sweep). 

In  Figure  7. 1,  the  squares  represent  a  set  of  four  adjacent  left  notches,  the  arrows 
indicate  the  squares  that  the  tape  head  should  be  reading  in  each  configuration,  the  x’s 
represent  l’s  in  ffLmask 0,  and  the  y’s  represent  1  ’s  in  ftLmask  j.  We  use  Lmasko  to  test 
transitions  from  even  numbered  left  notches.  Lmasko  isolates  a  constant  size  neighborhood 
around  the  tape  head  to  test  the  transition,  and  Lmasko  isolates  the  remainder  of  the 
configuration  to  test  that  it  remains  unchanged. 

We  describe  the  construction  of  Lmasko ;  MS  constructs  the  other  masks  similarly.  Let 
us  first  formally  define  Lmasko  over  a  single  slot.  A  slot  comprises  p  notches,  each  /  bits 
long.  In  left  notches  j  and  y+1,  for  j  even,  XLmasko  has  l’s  covering  identical  positions, 
then  in  left  notches  j+2  and  j+  3,  HLmasko  has  l’s  covering  positions  two  squares  to  the  left. 
In  notch  j ,  the  tape  head  is  on  th®  right  of  the  two  squares  in  the  covered  neighborhood:  in 
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Figure  7.1.  Portions  of  Lmask o  and  Lmask  \ 


notch  ;'+l,  the  tape  head  is  on  the  left  of  the  two  squares  in  the  covered  neighborhood.  (That 
is,  l’s  cover  the  bit  positions  corresponding  to  N[(config  (/'))  and  Nr{config  (J+l)).) 

Assume  without  loss  of  generality  that  T  (n )  is  even,  Let  d  denote  the  number  of  bits 
needed  to  specify  each  symbol  in  the  tape  alphabet,  and  let  e  denote  the  number  of  bits 
needed  to  specify  a  neighborhood. 

We  will  build  Lmask o  from  maskc  (defined  below): 

LmaskQ  =  maskc  v  ( maskc  T  / ) 

since  XLmasko  has  l’s  in  identical  positions  in  adjacent  pairs  of  left  notches.  We  build  our 
definition  of  maskc  from  a  definition  over  a  set  of  left  notches,  then  a  definition  over  a  single 
slot,  then  a  full-size  definition.  Let  us  define  the  following  masks.  Recall  that  m  = 

0(T2(n)). 

maska  =T(^2kW^  +  2kW+2d^]  +  ■■  +2  kW+2d)  +  e-\ 

maskf,  =P  ('^r'^+2\2Jd^T<-n'^2)-maska)  =  <?  (2^ ^2)2jd{2T(n )+!■'>)■  maska 

2J^2'P)maskb 

Note  that  maska  is  the  portion  of  maskc  over  only  a  single  set  of  left  notches,  and  maskb  is 
the  portion  of  maskc  over  only  a  single  slot. 


maskc  =  2^*(2‘p  maskb )  =  ( 
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Now  let  us  describe  how  MS  constructs  maskc .  First,  MS  constructs  maska  in 
O  (log  T ( n ))  time,  using  fT ( n  )/2  processors.  MS  next  computes  P  ^j^2)2id(2T(n)+2\ 

Observe  that  p  l(2T  ( n  )+2)  =0(T(n )),  so  MS  also  computes  this  quantity  in  O  (log  T  ( n  )) 

2m-\  . 

time.  MS  now  obtains  maskb  with  a  single  multiplication.  Finally,  MS  builds  E  =  2ip . 

This  is  more  difficult,  since  E  is  the  sum  of  2m  terms,  and  MS  must  build  E  in  O  (log  T (n )) 

=  O  (loglog  2m )  time.  We  rely  on  an  interaction  between  multiplication  and  shift,  as  in  our 
construction  of  lm .  We  will  show  that  E  is  also  the  product  of  m  easily  computed  terms. 
Define  brookk  =  2^/2)2‘  +  1.  Then 

MS  computes  brooks ,  1  <  k  <;  m ,  in  constant  time,  then  builds  E  in  O  (log  m )  time  with  m 
processors.  Now  we  build  maskc  =  E  maskb  in  one  step.  MS  next  builds  Lmask o  = 
maskc  v  ( maskc  T  / ).  MS  constructs  Lmask  u  Rmasko,  and  Rmask  i  similarly  in 
O  (log  T ( n ))  time. 

We  now  describe  how  we  use  Lmask o  to  test  whether  half  the  transitions  from 
configurations  in  left  notches  are  valid.  (We  will  use  Lmask  \  to  test  the  other  half.)  For  each 
j ,  where  j  is  even  and  the  number  of  a  left  notch,  we  test  whether 
configij)  b  config  (J+\).  Simulator  MS  performs  the  tests  for  all  such  configurations 
simultaneously. 

Since  a  TM  has  a  finite-state  control  and  a  neighborhood,  can  be  specified  in  a  constant 
number  of  bits,  there  are  only  a  constant  number  of  valid  transitions.  Let  N  denote  the  set  of 
all  e  bit  strings  y  such  that  y  contains  a  description  of  two  tape  symbols  written  by  Q  and  a 
description  of  a  state  of  Q'  at  the  position  of  the  tape  head.  That  is,  N  is  the  set  of  strings 
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describing  neighborhoods  that  can  actually  occur  in  a  computation  of  Q  .  Let  TR  denote  the 
set  of  pairs  (£,  rj)  such  that  r|  e  N ,  and  in  one  step  of  Q  ,  neighborhood  £  with  Q  in  the 
state  and  with  cell  contents  specified  by  becomes  neighborhood  rj;  that  is,  TR  is  the  set  of 
valid  transitions  between  neighborhoods.  For  each  (£,  q)  e  TR ,  MS  creates  masks  Cynask  = 

C 'H  and  r\mask  =  r|  (s  T/). 

Let  j  be  the  number  of  a  left  notch  and  let  j  be  even.  Let  yang  ( j )  denote  the  portion  of 
notch  j  where  the  tape  head  of  Q  should  be  in  a  computation  (the  position  of 
Nt  ( config  (J )));  let  yin  (J )  denote  the  rest  of  notch  j .  For  each  pair  (£,  q)  e  TR ,  MS  tests 
yang  (J )  for  equality  with  for  all  such  j ,  by  using  Cynask  For  those  found  equal,  MS  tests 
yang  (j+ 1)  forequality  with  q  using  r\mask ,  and  MS  tests  yin  (J)  and  yin  (J  + 1 )  forequality 
using  Lmasko .  For  those  found  equal,  MS  tags  notch  j +1  with  a  1.  With  a  constant  number 
of  such  tests,  MS  checks  the  transitions  from  even  numbered  left  notches  j . 

Similarly,  MS  tests  the  remainder  of  the  transitions  from  all  notches  in  all  slots  of  Im 
using  Lmask\,  Rmask0,  and  Rmask  j.  This  completes  Test  1. 

MS  now  performs  Test  2  and  Test  3.  Recall  that  Test  2  is  a  test  of  whether  the  first 
notch  in  each  slot  contains  the  initial  configuration  of  Q  ',  and  Test  3  is  a  test  of  whether  the 
last  notch  in  each  slot  contains  an  accepting  configuration  of  Q  '.  In  O  (log  T{n ))  time,  MS 
builds  masks  for  each  of  these  tests  in  the  same  manner  as  its  other  masks,  then  performs 
Tests  2  and  3  in  constant  time,  tagging  those  notches  that  pass  the  test  with  a  1  and  those 
notches  that  fail  with  a  0. 

For  each  slot,  MS  now  ANDs  all  its  rgs  together  (the  tags  wntten  during  the  tests).  All 
the  tags  for  all  the  slots  are  concatenated  in  a  single  cell  c  ( g ).  If  any  notch  in  a  slot  fails  one 
of  the  tests,  then  the  AND  of  tags  in  that  slot  is  0.  If  every  notch  in  a  slot  passes  every  test. 
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then  the  AND  of  tags  in  that  slot  is  1 .  If  con  (g )  =  0,  then  some  notch  in  every  slot  has  failed 
a  test,  and  no  slot  holds  a  valid  computation;  therefore,  MS  rejects  the  input  If  con  (g )  *  0, 
then  some  slot  holds  a  representation  of  a  valid  computation;  therefore,  MS  accepts  the 
input. 

Theorem  7.1.  For  all  T(n )  >  log  n,  NTIME  (T (n ))  c  PRAM  [*  ,t]-7YM£(log  T(n )). 

Proof.  By  the  simulation  above,  a  PRAM[*,t],  MS ,  simulates  an  NTM,  Q  ,  via  an  oblivious 
sweeping  NTM,  Q  .  Q  simulates  Q  in  time  O  (T2(n )),  and  MS  simulates  Q  ,  hence  Q ,  in 
time  O  (log  T  («)).  □ 

Corollary  7.1.1.  NEXPTIME  c  PRAM  [*  ,1 ']-PTIME . 

7.2.  Simulation  of  PRAM[*,t,i]  by  TM 

By  itself,  the  shift  operation  produces  numbers  that  can  be  extremely  long,  but  not  very- 
complex.  We  took  advantage  of  this  lack  of  complexity  by  manipulating  encodings  of 
numbers  when  we  simulated  a  PRAM[T,4.]  in  Chapter  6.  By  itself,  the  multiplication 
operation  produces  numbers  that  are  more  complex,  but  reasonably  bounded  in  length.  We 
took  advantage  of  this  bound  on  length  by  addressing  individual  bits  of  numbers  generated 
with  multiplication  when  we  simulated  a  PRAM[*]  in  Chapter  4.  The  combination  of 
multiplication  and  shift  gives  rise  to  extremely  long  numbers  of  greater  complexity.  To 
simulate  a  PRAM[*,T,1],  a  TM  once  again  uses  the  interesting  bit  encoding  to  deal  with  the 
length;  the  increased  complexity  of  the  numbers  requires  the  TM  to  use  more  space. 

If  /  =  j*k,  where  j  has*,  interesting  bits,  and  k  has xk  interesting  bits,  then  i  has  up  to 
Xj  xk  interesting  bits.  Consequently,  an  encoding  of  a  number  generated  by  a  PRAM[*,t,i) 
may  have  a  doubly  exponential  number  of  nodes.  Simon  (1981a)  gave  only  a  short  sketch  as 
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a  proof  that  RAM  [*  ,T] -PTIME  c  EXPSPACE .  He  gave  the  bare  bones  of  a  proof  of  a 
containment  in  doubly  exponential  space  by  writing  out  entire  encodings,  then  said  that  the 
space  bound  could  be  reduced  to  singly  exponential  space  by  using  pointers  into  an 
encoding.  Here  we  present  the  details  of  a  simulation  of  a  PRAM[*,t,ij  running  in 
polynomial  time  by  a  Turing  machine  running  in  exponential  space.  This  simulation  is 
similar  to  the  direct  simulation  of  a  PRAM[1\i]  running  in  polynomial  time  on  a  Turing 
machine  running  in  polynomial  space  (Section  6.3)  in  that  we  manipulate  pointers  into  the 
interesting  bit  encoding  of  cell  contents.  Since  the  number  of  interesting  bits  is  at  most 
doubly  exponential,  the  TM  uses  exponential  space  to  describe  a  pointer  into  an  encoding. 

Let  Q  be  a  PRAM[*,T,i|  that  uses  T (n  )  time  and  P  (n  )  processors.  Let  Q  be  a 
PRAM[*,t,l]  that  uses  only  short  addresses  and  simulates  Q  according  to  the  Associative 
Memory  Lemma.  Thus,  Q  uses  O  (P2(n  )T (n  ))  processors,  0{T(n))  time,  and  only 
addresses  in  0,  1 . O  (P  (n  )T  (n  )). 

We  construct  a  TM  M  that  simulates  Q  via  Q  in  O  ( T\n  )4TM)  log  n )  space.  Without 
loss  of  generality,  assume  M  is  given  T  (n ). 

M  uses  five  mutually  recursive  procedures:  PCOUNTER  ,  COMPARE  ,  SYMBOL  , 
ADD ,  and  MULTIPLY .  The  first  four  of  these  are  the  procedures  of  the  same  names  from 
Section  6.3,  the  direct  simulation  of  a  PRAM[T,i]  by  a  TM,  except  that  if  instr  is 
r(i  )«—  r  (j)  *  r(k)  in  SYMBOL ,  then  SYMBOL (j ,  \j/,  m,  r)  returns 

MULTIPLY  (j ,  k ,  \j/,  m ,  t- 1).  Every  parameter  can  be  written  in  the  space  required  to  write 
a  pointer  into  an  encoding  (This  space  will  be  bounded  in  Lemma  7  2.3  ) 

When  we  add  a  column  of  partial  products  while  performing  a  multiplication,  the 
column  can  include  up  to  z  1  ’s  in  it,  where 
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z  =0(22-  *]  T(»J)-1 

is  the  operand  length.  To  express  the  sum  of  the  column,  we  would  have  to  represent  every 
integer  from  0  to  z ,  and  we  cannot  represent  all  such  numbers  in  exponential  space, 
regardless  of  the  representation.  We  use  the  Booth  multiplier  encoding  algorithm  (Hwang, 
1979)  to  overcome  this  obstacle. 

Let  d  be  an  integer  and  let  w  =  len(d).  Let  bw-\  •  •  •  bo  be  the  w-bit  two’s  complement 
representation  of  d.  Define  b.\  =  0.  A  plus  Booth  bit  of  d  is  a  bit  bt  such  that  bt  =  0  and 
£>,_  1  -  1;  a  minus  Booth  bit  of  d  is  a  bit  6;  such  that  =  1  and  £>;_  1  =  0.  A  Booth  bit  of  d  is 
a  bit  that  is  either  a  plus  Booth  bit  or  a  minus  Booth  bit. 

We  define  the  Booth  (B)  encoding  as  B(d)  =  cw_i  •  ••  Co.  where  c,  =  1  if  bt  is  a  plus 
Booth  bit,  c,  =  1  (-1)  if  bi  is  a  minus  Booth  bit,  or  c,  =  0  otherwise.  The  Booth  encoding 
replaces  strings  of  l’s  with  a  1,  0’s,  and  a  1  (-1). 

For  example,  B(01 1 11111)  =  10000001.  Suppose  we  want  to  multiply  a  multiplicand 
m  by  multiplier  01111111.  By  the  naive  algorithm,  we  add  seven  partial  products,  each  of 
which  is  m  shifted  by  some  value;  by  the  Booth  algorithm,  we  add  only  m  T  7  and  -m . 

We  define  the  Booth-interesting  (BI)  encoding,  by 
BI(0)  =  0, 

Bid)  =  1, 

B\(d)  =  (I(<3t) . I  (a,);r), 

where  a,  is  the  position  of  the  y  th  Booth  bit  of  d  and  r  is  the  value  (0  or  1)  of  the  rightmost 
digit  of  Bid). 

The  Booth-interesting  encoding  and  the  interesting  bit  encoding  are  closely  related:  an 
interesting  bit  at  position  j  of  M  corresponds  to  a  1  or  T  in  B(d)  at  position  j+l.  The 
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rightmost  nonzero  in  B(d)  is  a  1,  with  1  and  1  alternating  afterwards.  If  #d  begins  with  a  1 
in  the  rightmost  position,  then  B(i i)  has  a  1  in  the  rightmost  position;  otherwise,  B (d )  has  a  0 
in  the  rightmost  position.  Because  of  this  relationship,  M  readily  converts  from  l(d )  to 
BI(d )  with  procedure  ADD  . 

To  perform  a  multiplication,  we  will  convert  the  multiplier  from  the  interesting  bit 
encoding  to  the  Booth-interesting  encoding,  then  multiply  it  with  the  multiplicand.  By 
Booth’s  algorithm,  we  have  only  as  many  partial  products  as  we  have  Booth  bits.  Since  each 
integer  in  a  computation  has  at  most  O  ( nin ”)  interesting  bits  (Lemma  7.2.3),  there  will  be 
0{n?T"))  partial  products. 

We  now  describe  MULTIPLY (J ,  k ,  ty,  m ,  t ),  where  j  and  k  are  register  addresses,  y  is 
a  pointer,  m  is  a  processor  number,  and  r  is  a  time  step.  MULTIPLY  (j ,  k  \f,  m ,  t)  returns 
the  symbol  to  which  vy  points  in  I(va/ (rconm(j ))*val ( rconm  ( k )))  if  v  points  to  a  leaf  in  the 
product;  otherwise,  MULTIPLY  returns  a  signal  that  y  points  to  a  subtree  of  the  product. 
Assume  that  we  are  manipulating  pointers  into  the  Booth-interesting  encoding  of  the 
multiplier,  val  ( rconm  (k )).  Let  y  denote  I(va(  (rconm  (j  ))*val  { rconm  (k ))). 

Using  the  Booth  encoding  of  the  multiplier,  we  have  O  (n2ri">)  partial  products.  Each 
partial  product  p  is  a  copy  of  the  multiplicand  shifted  by  the  value  of  the  position  of  a  Booth 
bit,  b ,  in  the  multiplier.  If  b  is  a  plus  Booth  bit,  then  p  is  added;  if  b  is  a  minus  Booth  bit. 
then  p  is  subtracted.  To  efficiently  find  the  symbol  y.\j/,  we  perform  carry-save  addition  of 
the  partial  products.  This  simplifies  our  computations,  since  if  t  <—u  O  v ,  where  O  is  a 
Boolean  operation,  then  each  first-level  subtree  of  I(r)  is  a  first-level  subtree  of  I(u  )  or  I(v  ) 

To  perform  carry-save  addition  on  three  numbers,  t ,  u ,  and  v ,  we  generate  a  sum  term 
S  and  a  carry  term  C  by  Boolean  operations  on  t,  u ,  and  v .  Then  we  add  5  and  C  to  get 
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t  +  u  +  v.  Specifically,  S  =  t©u©v  and  C  =  (majority  (t,  u,  v))  T  1,  where 
(majority  (r,  u,  v))  returns  0  (1)  in  position  y  if  the  majority  of  bits  in  position  y  of  # r,  #u, 
and  #v  are  0  (1). 

We  wish  to  compute  the  sum  of  g  =  n2  operands,  OP  OP  2 , ...,  OPg.  We  use  a 
divide-and-conquer  method,  repeatedly  splitting  each  sum  of  y  operands  into  two  sums  of 
v  /2  operands,  until  y  =  2.  When  we  reach  the  base  case,  we  declare  one  operand  to  be  S  and 
the  other  to  be  C,  then  return  one  level  up  in  the  recursion.  At  this  level,  two  partial  sums 
return  their  sum  and  carry  terms:  5 1  and  C 1  and  S 2  and  C  2-  We  produce  sum  and  carry 
terms,  53  and  C 3,  for  5(  +  C  \  +  S 2,  then  sum  and  carry  terms,  S4  and  C4,  for 
S  3  +  C  3  +C  2-  Then  S4  and  C4  are  returned  up  to  the  next  level  of  recursion.  At  the  end  of 
the  recursion,  we  have  a  sum  term  Sg  and  a  carry  term  Cg,  both  of  which  were  generated 
solely  by  Boolean  operations,  and  we  add  these  together. 

This  description  overlooks  one  important  factor:  just  as  we  do  not  have  enough  space 
to  write  an  integer  generated  by  a  PRAM[*,t,i]  or  even  its  encoding,  we  do  not  have 
enough  space  to  write  a  sum  or  carry  term.  The  key  here  is  keeping  track  of  subtrees  through 
the  recursion.  Instead  of  viewing  this  problem  from  the  bottom  as  combining  sum  and  carry' 
terms  until  we  obtain  Sg  and  Cg,  let  us  view  the  problem  from  the  top.  M  wants  to  obtain  the 
symbol  indicated  by  the  pointer  into  the  encoding  of  the  sum  of  the  partial  products  y  and 
is  using  carry-save;  consequently,  val  (y)  =  Sg  +  C?.  By  our  rules  of  addition  'Appendix  E). 
M  needs  symbols  in  Sg  and  in  Cg.  Suppose  M  wants  a  particular  symbol  in  Sg.  By  the 
procedure  described  above,  Sg  is  the  XOR  of  three  sum  and  carry  terms;  therefore,  M  looks 
for  symbols  in  each  of  the  terms  contributing  to  Sg  according  to  our  rules  for  XOR.  (These 
ruks  are  similar  to  the  rules  for  OR  in  Appendix  D.)  M  continues  in  this  manner  until  it 
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reaches  a  base  case,  which  is  handled  by  our  previously  described  rules.  The  recursion  stops 
after  O  (log  g)  =  0  (n2™)  levels.  This  completes  the  description  of  MULTIPLY. 

We  now  present  lemmas,  analogous  to  those  of  Section  6.1,  that  bound  the  length  of  a 
pointer  into  an  encoding.  Lemma  7.2. 1  bounds  the  depth  of  an  encoding  and  the  number  of 
interesting  bits  in  a  number  generated  by  a  PRAM[*,T,>1].  Let  bool  be  a  set  of  Boolean 
operations. 

Lemma  7.2.1.  If  a  processor  Pm  executes  r(i)<-r(j)  *  r(k),  then  depth  ( rconm(i ))  <2  + 
max  {depth  (rconm(j)),  depth  (rconm(k)))  and  intbits  (rconm(i))  < 
intbits  (rconm(j))  *  (1  +  intbits  (rconm(k))). 

Proof.  The  product  rconm(i)  is  the  sum  of  partial  products.  Each  nonzero  partial  product  is 
rconm(j )  shifted  left  by  the  value  of  the  position  of  a  1  bit  in  rc.onm{k).  We  can  add  the 
partial  products  by  carry-save  addition.  Thus,  we  can  perform  a  series  of  Boolean  operations 
on  the  partial  products  and  a  single  addition  at  the  end.  By  Parts  i)  and  iv)  of  Lemma  6.1.1, 
depth  ( rconm(i ))  <2  +  max  { depth  ( rconm(J )),  depth  ( rconm(k ))}.  Recall  that  we  convert  the 
multiplier  to  the  Booth-interesting  encoding  in  the  procedure  MULTIPLY.  For  an  integer  d. 
the  number  of  Booth  bits  in  B(d)  is  at  most  1  +  intbits  (d);  therefore,  Parts  i)  and  ii)  of 
Lemma  6.1.1  apply  to  the  number  of  Booth  bits  in  B(d).  Thus,  the  number  of  Booth  bits  in 
B(rconm(k))  is  1  +  intbits  ( rconm(k )).  Therefore,  rconm{i)  is  the  sum  of  1  + 
intbits  {rconm{k))  nonzero  partial  products,  each  with  intbits  (rconm(j))  interesting  bits. 
Therefore  by  Part  i)  of  Lemma  6.1.1,  intbits  ( rconm(i ))  < 
intbits (rconm(j))  *  (1  +  intbits (rconm(k))).  □ 

Part  i)  of  Lemma  7.2.2  bounds  the  number  of  subtrees  below  first  level  nodes  in  an 
encoding;  Part  ii)  bounds  the  number  of  subtrees  below  fth  level  nodes  in  an  encoding,  /  >  1 . 
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Lemma  7.2.2.  Suppose  a  processor Pm  executes  r(i)*—r(j)  O  r(k),  where  O  e  {+,  T,  1,  *, 

bool},  l(rconm(i))  =  (I (ar) . I(<a  a );  w,),  l(rconm(J))  =  (I (6,), ....  1(6 1);  wj),  and 

I (rconm(k))  =  (I(cf), I(c  i);  wk),  where  av,  bv,  cv  denote  the  positions  of  the  vth  interesting 
bits  of  rconm(i),  rconm(J),  and  rconm{k),  respectively. 

i)  For  I(av)  (that  is,  the  vth  subtree  at  level  1  of  I(rcortm(i))),  if  O  is  *,  then 

intbits  (av)  <  max  ( intbits  (bq ) }  +  max  {intbits  (cq)}. 

<?  <t 

ii)  For  I ((3)  a  subtree  at  level  f>  1,  intbits  (3)  <  1  +  max  { intbits  (^th  subtree  of 

q 

rconm(j)  at  level  f),  intbits (17th  subtree  of  rconm(k)  at  level  /)}. 

Proof,  i)  A  first  level  subtree  I(av)  encodes  the  position  p  of  an  interesting  bit  in  rconm  ( i )• 
The  subtrees  of  I(av)  encode  the  positions  of  interesting  bits  in  p.  Suppose  Pm  executes 
r  (i)<r-r(j)  *  r(k).  Recall  that  we  convert  the  multiplier  to  the  Booth-interesting  encoding  in 
procedure  MULTIPLY  and  that  for  an  integer  d,  the  position  of  a  Booth  bit  in  b{d)  is  exactly 
1  beyond  the  position  of  an  interesting  bit  in  d.  The  product  rconm(i)  is  the  sum  of  partial 
products  of  rconm(J)  and  B(rconm(£)).  Each  nonzero  partial  product  is  either  plus  or  minus 
rconm(J)  shifted  left  by  the  value  of  the  position  of  a  Booth  bit  in  b(rconm(k)).  By  Part  i)b) 
of  Lemma  6.1.2  and  the  relationship  between  Booth  bits  and  interesting  bits,  the  number  of 
Booth  bits  in  the  position  of  a  1  or  1  in  a  partial  product  is  at  most  max  {intbits  ( bq ))  + 

<t 

max  { intbits  (c?)}.  By  Part  i)a)  of  Lemma  6,1.2,  the  position  of  an  interesting  bit  in  rconm(i) 

q 

has  the  same  upper  bound. 

ii)  For  any  instruction,  we  add  at  most  1  to  the  value  of  a  subtree  of  level  /  >  1 ,  so  Part 


i)a)  of  Lemma  6.1.2  applies.  □ 
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Lemma  7.2.3.  A  pointer  used  by  M  can  be  specified  in  0(T(n)  log  n)  space. 

Proof.  Let  d  be  an  integer  generated  by  Q.  By  Lemmas  6.1.1  and  7.2. 1,  depth  (d)  <  2T(n). 

If  to  is  the  input  to  Q  and  <o  has  length  n,  then  intbits  (co)  <  n.  Let  I((3)  be  1(d)  or  a  subtree  of 
1(d).  By  Lemmas  6.1.1,  6.1.2,  7.2.1,  and  7.2.2,  intbits  (P)  <n2T'"\  Therefore,  any  leaf  in  1(d) 
can  be  specified  by  a  pointer  of  length  T  ( n )  2™  log  n.  (The  tree  has  T  (/t)  levels,  and  we 
need  space  2T[n)  log  n  to  specify  the  branch  at  each  level.)  □ 

Theorem  7.2.  For  all  T (n)  >  log  n,  PRAM  [*,  T ,i]-TIME  ( T(n ))  c 
DSP  ACE  (T2(n)  4r(n'  n  log  n). 

Proof.  By  the  simulation  given  above,  a  TM  M  simulates  a  PRAM[*,T,i]  Q  via  Q’ . 

By  Lemma  7.2.3,  for  each  invocation  of  PCOUNTER,  COMPARE ,  SYMBOL ,  ADD , 
and  MULTIPLY ,  M  can  write  all  variables  and  parameters  in  space  O  ( T (n)  2r<n)  log  n  ).  The 
depth  of  recursion  of  the  first  four  of  these  procedures  is  at  most  O  (T  (n)).  The  depth  of 
recursion  of  MULTIPLY  is  at  most  O  ( n2T('n) )  for  each  invocation.  Thus,  M  can  simulate  Q 
via  Q’  in  space  0(T2(n)  4r(n>  n  log  n).  □ 

Corollary  7.2.1.  PRAM  [*,  T ,i]-PTIME  c  EXPSPACE. 
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Chapter  8.  Probabilistic  Choice 

In  this  chapter,  for  various  instruction  sets  op ,  we  present  simulations  of  probabilistic 
PRAM[op]s  by  deterministic  PRAM[op]s.  We  also  relate  probabilistic  unbounded  fan-in 
circuits  and  CRCW  prob- PRAMs. 

8.1.  Background 

Much  attention  and  study  have  been  devoted  to  probabilistic,  or  randomized,  algorithms 
in  the  past  several  years.  Survey  papers  by  Rabin  (1976),  Welsh  (1983),  and  Rajasekaran 
and  Reif  (1987)  present  a  sampling  of  the  work  done  on  probabilistic  algorithms.  Naturally, 
the  probabilistic  Turing  machine  (PTM)  is  the  foundation  for  many  probabilistic  algorithms. 
Gill  (1977)  defined  the  PTM  as  tossing  a  coin  to  decide  state  transitions.  He  also  defined 
different  restrictions  on  language  recognition:  1 -sided,  bounded  2-sided,  and  2-sided  error. 
PSP  ACE  on  PTMs  is  equivalent  to  PSP  ACE  on  deterministic  Turing  machines  (Simon, 
1981b). 

Reif  (1984)  presented  simulations  between  prob-RAM[*,+]s  and  pro6-PRAM[*,-i-]s. 

We  define  a  configuration  of  a  RAM  to  comprise  the  contents  of  each  of  the  registers  used  in 
memory  and  the  contents  of  the  program  counter  of  the  processor. 

Reif  defined  his  model  as  follows.  Let  c  be  a  constant.  From  any  configuration  C,  ,  a 
probabilistic  machine  Q  may  enter  any  configuration  from  the  set  NEXT,  in  one  step,  where 
NEXT ,  contains  no  more  than  c  elements,  c  a  constant.  Q  chooses  each  element  of  NEXT , 
with  equal  probability,  independently  of  previous  and  succeeding  choices.  Q  accepts  an 
input  string  co  of  length  n  in  time  T(n)  if  the  probability  that  a  computation  of  Q  on  o) 
reaches  an  ACCEPT  instruction  within  T  ( n )  steps  is  strictly  greater  than  l/i.  We  specify  the 
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expression  of  the  choice  of  next  configuration  as  follows.  Each  instruction  is  distinctly 
numbered  and  has  the  form: 

r(i)  *-r(j)0  r(k)\  plt  p2,  ....  p / 

in  which  pi ,  p2,  p/  are  integers  denoting  instruction  labels,  and  /  <  c;  the  processor 
executes  r(i)  <—  r(J)  O  r{k),  then  uniformly  selects  one  of  { Pi ,  p2,  ....  p /}  as  the  next 
instruction.  Thus  a  machine  in  a  configuration  such  that  it  is  currently  executing  the 
instruction  above  has  a  choice  of /possible  next  configurations.  We  allow  repetition  of 
choices  for  next  instruction  (to  weight  the  probability  of  selection). 

Three  theorems  from  Reif  (1984)  relating  prob- RAMf*,-!-]s  and  proh-PRAM[*,+]s 
follow. 

Theorem  8.1.  (Reif,  1984)  Let  Q  be  a  proh-RAM[*,+]  with  constructible  time  bound 
T(n)>n,  memory  bound  S  ( n ),  and  integer  bound  I  (n),  where  S  (n)  bounds  the  number  of 
registers  used  by  Q  and  /  (n )  bounds  the  value  of  the  largest  integer.  Then  there  is  a  prob- 
PRAM[*,+]  PZ  that  simulates  Q.  If  Q  is  unit-cost,  then  PZ  has  unit-cost  time  bound 
O  (5  (n)log  I{n)  +  log  T ( n ))  and  processor  bound  O  (I(n)SMT ( n ));  if  Q  is  log-cost,  then  PZ 
has  log-cost  time  bound  O  ((S  ( n )  +  log  T ( n ))2)  and  processor  bound  O  (4S(n)T (n)). 

Proof  (sketch).  PZ  activates  one  processor  for  each  pair  (x,  t),  where  x  is  a  configuration  of 
Q  and  r  is  a  time  step.  Let  NEXTx  denote  the  set  of  possible  configurations  reachable  by  Q 
from  x  in  one  step.  Each  processor  with  pair  (x,  t)  randomly  chooses  some  x'  e  NEXTx  and 
writes  (x\  f  +  1).  This  gives  the  equivalent  of  a  one-step  transition  matrix.  PZ  then  computes 
the  transitive  closure  of  that  matrix.  □ 

Theorem  8.2.  (Reif,  1984)  Let  PZ  be  a  proh-PRAM[*,-!-]  with  constructible  time  bound 
T (n),  memory  bound  5 (n),  and  processor  bound  P  (n).  Then  there  is  a  proh-RAM[*,+]  Q 
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with  memory  bound  O  (S(n)  +  P  («))  simulating  PZ.  If  PZ  is  unit-cost,  then  Q  has  unit-cost 
time  bound  O  ( T  ( n)P  («));  if  PZ  is  log-cost,  then  Q  has  log-cost  time  bound 
0(7 (n)P(n)  log  /»(«)). 

Proof  (sketch).  Q  performs  a  brute-force  simulation,  simulating  one  processor  at  a  time.  □ 

Theorem  8.3.  (Reif,  1984)  Let  Q  be  a  prob- RAM [*,+]  with  constructible  unit-cost  time 
bound  T (n)  >  n  and  integer  bound  I  (n).  There  is  a  prob-PRAM[*,-!-]  PZ  that  simulates  Q 
and  has  unit-cost  time  bound  O  ((7 ( n )  log  7 (n)  log(7 ( n)I  (n)))%). 

Proof  (sketch).  PZ  partitions  the  7 ( n )  time  steps  into  consecutive  intervals  of  length  L.  The 
interval  length  is  approximately  T'^in).  PZ  then  computes  a  look-up  table  that,  for  each 
configuration  x  of  Q,  contains  a  configuration  X  reachable  by  Q  from  configuration  x  in  L 
steps.  □ 

8.2.  Choice  Sequence  Simulation 

In  this  section,  we  simulate  a  prob-PRAM[op]  by  a  deterministic  PRAM[o/?].  The 
deterministic  simulator  evaluates  all  possible  sequences  of  random  choices  in  a  computation 
of  the  prob-?KAM[op]. 

Let  T,  be  the  set  of  processors  of  prob- PRAM  PZ  that  makes  a  random  choice  at  time  f. 
We  call  the  choice  made  by  the  ith  lowest  numbered  processor  in  T,  the  ilk  random  choice 
made  by  PZ  at  time  t.  Suppose  that  the  processors  of  PZ  have  made  a  total  of  j  random 
choices  before  time  r,  and  suppose  that  k  =  i  +  Then  we  call  the  choice  made  by  the  (th 
lowest  numbered  processor  in  T,  the  kth  random  choice  made  by  PZ.  A  computation  of 
prob- PRAM  PZ  has  choice  sequence  ao,  a  i ,  ....  ar_ i  if  in  the  yth  random  choice,  for  all 
0  £  j  <  r-1,  a  processor  chooses  the  ayth  alternative. 
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In  this  section,  we  simulate  a  pro6-PRAM[op]  or  prob-RAM[op]  by  a  deterministic 
PRAM(op]  D.  Machine  D  will  generate  all  possible  choice  sequences  of  the  simulated 
machine,  then  determine  the  number  of  choice  sequences  that  lead  to  acceptance. 

For  an  integer  x,  let  #x  denote  its  two’s  complement  representation.  If  #jt  = 
br~\  •  •  •  b\b§,  then  let  <x  >  denote  the  sequence  bo,b\ . br_ 

•  Sequential  case 

For  clarity  of  exposition,  we  present  Theorem  8.4,  the  simulation  of  a  sequential  prob- 
RAM[op]  by  a  deterministic  PRAMfop].  The  results  stated  in  Theorem  8.4  are  a  corollary  of 
Theorems  8.5  and  8  6  with  P(n)  =  1. 

Theorem  8.4.  Let  op  6  {0,  {*),{*,+  },  (T,i),  {*,T,i} }.  Let  Q  be  a prob-RAWl[op]  with 
time  bound  T  ( n )  that  makes  R  ( n )  random  choices.  There  is  a  deterministic  PRAM[op]  D 
that  simulates  Q  in  0(T (n))  time  with  2*(n)  processors. 

Proof.  Suppose  that  each  random  choice  made  by  Q  is  made  between  two  alternatives.  (At 
the  end  of  the  proof,  we  specify  changes  to  the  proof  for  a  Q  that  is  allowed  more  than  two 
choices.)  We  construct  a  deterministic  PRAM[op]  D  that  simulates  Q.  Simulator  D 

activates  2R  (n)  processors  in  O  {R  (n))  time.  These  processors  are  numbered  2R<'n) . 

2-2R{n)  -  1.  Each  processor  number  encodes  a  unique  choice  sequence  of  R  (n)  elements. 

Pm  computes  om  =  m  -  2R(n\  Pm  will  simulate  a  computation  of  Q  with  choice  sequence 
<Gm>- 

Pm  sets  mask  =  1.  Pm  will  use  mask  to  read  bits  of  #Gm. 

In  simulating  a  general  step  of  Q  in  which  Q  executes  instruction  instr ,  Pm  does  the 
following.  If  Q  makes  no  random  choices  in  instr,  then  Pm  simply  executes  instr.  If  Q 
makes  a  random  choice  in  instr,  then  Pm  executes  the  portion  of  instr  before  making  the 
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random  choice,  then  uses  mask  to  read  the  bit  of  cm  indicating  the  outcome  of  the  random 
choice.  Next,  Pm  updates  mask  to  prepare  for  the  next  random  choice  by  adding  mask  to 
itself.  Since  ttmask  had  a  single  1  in  the  y'th  bit  position,  after  the  addition  ftmask  has  a  single 
1  in  the  >+lst  bit  position,  enabling  Pm  to  read  the  y'+lst  bit  of  om  when  the  next  random 
choice  arises. 

In  this  manner,  Pm  simulates  each  step  of  Q  with  choice  sequence  <cm>  in  constant 

time. 

After  simulating  T  (n)  steps  of  Q,  Pm  decides  whether  Q  halts  on  an  ACCEPT 
instruction.  D  now  computes  the  number  of  computations  that  have  accepted  and  the 
number  that  have  not  accepted,  then  uses  this  information  to  decide,  based  on  the  acceptance 
conditions,  whether  to  accept  or  reject  the  input.  This  takes  O  (R  ( n ))  time.  The  entire 
computation  takes  O  (R  ( n )  +  T  (n))  steps.  Since  R(n)£T  (n),  D  simulates  Q  in  O  (T  (n)) 
time  with  2R{n)  processors 

Now  suppose  Q  can  randomly  select  from  up  to  c  labels  for  the  next  instruction.  Let  d 
be  the  least  common  multiple  of  1,  2, ...,  c.  Alter  the  instructions  with  random  choices  so 
that  each  has  a  choice  of  d  labels.  Let  e  =  f  log  d~\  .  For  an  integer  x,  where  #x  = 
k>r- 1  ‘ '  ‘  b\bo,  let  <jc  >'  denote  the  sequence  a©,  a\, ...,  ar/e_  such  that  a0  =  be  •  •  •  bo<  a , 
-b 2*  •  -  •  be+ 1, ...,  ar/e  _  i  =  br- 1  •  •  •  hr-*.  We  say  that  a  computation  of  Q  has  choice 
sequence  <x  >'  if,  for  all  0  <  j  <  r/e  -  1,  in  the  yth  random  choice  made  by  Q  from  among  d 
alternatives,  Q  chooses  the  a;th  alternative. 

Let  /  be  the  least  power  of  2  greater  than  or  equal  to  d.  The  deterministic  PRAM[op]  D 
activates  fR  (n)  processors  in  0  {R  (n))  time.  The  processors  are  numbered  fR  <n) ,  .., 

2 -fR{n)  -  l.  Each  processor  number  encodes  a  unique  choice  sequence  of  R  (n)  elements. 
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Pm  computes  am  =  m  -  fRW.  Pm  will  simulate  a  computation  of  Q  with  choice  sequence 
<am>'.  In  the  course  of  simulating  the  computation  as  described  above,  if  Pm  reads  an 
element  <3,  of  <cm>'  such  that  a,  >  d,  then  the  choice  sequence  is  invalid.  Pm  then  does  not 
report  its  computation  as  either  accepting  or  rejecting.  Of  the  fR{n)  sequences,  dR  (n)  are 
valid,  exactly  the  number  of  choice  sequences  of  Q  with  d  choices  per  random  instruction.  □ 

•  Parallel  case 

We  extend  the  simulation  of  a  sequendal  machine  to  a  simulation  of  a  parallel  machine. 

Theorem  8.5.  Let  op  e  {{*},{*,-*-},  (T.4.),  (*,T,i-} }.  Let  PZ  be  a prob-PRAM[op]  with 
time  bound  T (n)  and  processor  bound  P  ( n )  that  makes  R  (n)  random  choices.  There  is  a 
deterministic  PRAM[op]  D  that  simulates  PZ  in  time  O  (R  (n)  +  T (n)  log  P  ( n ))  with 
P  (n  )2r  (n)  processors. 

Proof.  Assume  that  '*',,'h  random  choice  made  by  PZ  is  made  between  two  alternatives.  If 
PZ  makes  random  choices  from  among  more  than  two  alternatives,  then  modify  the  proof  as 
described  in  the  proof  of  Theorem  8.4. 

We  construct  a  deterministic  PRAM[op]  D  that  simulates  PZ.  D  activates  2R(n) 

processors  in  O  ( R  ( n ))  time.  These  processors  are  numbered  2*(n> . 2-2R(-n)  -  1.  Each 

processor  number  encodes  a  unique  choice  sequence  of  R  (n)  elements.  Processor  Pm 
computes  om  =  m  -  Pm  will  simulate  a  computation  of  PZ  with  choice  sequence 

«Sm>- 

To  simulate  an  access  to  shared  memory  cell  c  (k)  in  a  computation  of  PZ  with  choice 
sequence  «Jm>,  D  accesses  its  memory  cell  at  address  k- 2R(n)  +  cm.  Given  k ,  D  can 
compute  k-2R{n)  +  0m  in  constant  time. 
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Assume  without  loss  of  generality  that  each  processor  Pm  of  D  has  two  local  memories: 
Imem  j  and  Imem  2  ■  Pm  uses  Imem  y  to  simulate  the  memory  of  a  processor  of  PZ  and  /mem  2 
to  perform  its  own  computations. 

Processor  Pn,  2R{n)  <  m  <  2-2*('l)-l,  simulates  PZ  with  choice  sequence  <cm>. 
Processor  Pm  of  D  corresponds  to  Po  of  PZ  until  Pq  executes  a  FORK  instruction.  At  this 
time,  Pm  executes  a  FORK  instruction,  halting  and  activating  P  2m  and  Pim+i  •  After  this 
time,  P 2m  D  corresponds  to  Po  of  PZ  and  does  not  FORK  any  more,  and  P 2^+1 
corresponds  to  P 1  of  PZ. 

In  simulating  a  general  step  of  PZ  in  which  processor  P*  executes  instruction  instr,  the 
corresponding  processor  P;  of  D  does  the  following.  If  P*  makes  no  random  choices  in  instr, 
then  Pj  simply  executes  the  instruction.  If  P*  makes  a  random  choice  in  instr,  then  P}  must 
choose  a  bit  of  cm  to  decide  the  outcome  of  the  random  choice  of  PZ.  Suppose  W  processors 
want  to  make  a  random  choice  at  this  step.  The  corresponding  W  processors  of  D  son 
themselves  by  processor  number  in  O  (log  W)  time  (Cole,  1986).  Suppose  P}  is  the  vth 
lowest  numbered  processor  wishing  to  make  a  random  choice  at  this  step.  Then  P;  reads  the 
vth  bit  of  #am.  Pj  uses  this  bit  to  decide  whether  to  select  the  first  or  second  label  listed  in 
instr.  Processor  Pim  then  shifts  Gm  right  by  W  bits,  leaving  only  the  unread  bits  of  om.  Note 
that  W  <P(n). 

In  this  manner,  D  simulates  each  step  of  PZ  with  choice  sequence  <om>  in 
0(logP(n))  time. 

After  simulating  T(n)  steps  of  PZ,  P^  decides  whether  PZ  halts  on  an  ACCEPT 
instruction.  D  now  computes  the  number  of  computations  that  have  accepted  and  the 
number  that  have  not  accepted,  then  uses  this  information  to  decide,  based  on  the  acceptance 
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conditions,  whether  to  accept  or  reject  the  input.  This  takes  O  ( R  ( n ))  time.  The  entire 
computation  takes  0(R  (n)  +  T(n)  log  P  (n ))  steps.  □ 

Next,  we  simulate  a  prob- PRAM  by  a  deterministic  PRAM  with  the  basic  instruction 
set.  The  simulation  is  similar  to  the  preceding  one,  except  we  do  not  interleave  memory 
locations  allocated  to  different  choice  sequences,  and  the  basic  PRAM  must  precompute 
tables  of  addresses  and  masks  in  order  to  access  memory  cells  and  the  choice  sequence. 

Theorem  8.6.  Let  PZ  be  a  prob-PRAM  with  time  bound  T ( n )  and  processor  bound  P  (n ) 
that  makes  R  ( n )  random  choices  There  is  a  deterministic  PRAM  D  that  simulates  PZ  in 
time  O  {R  (n)  +  T (n)  log  P(n))  with  P(n) lR(n)  processors. 

Proof.  Given  a  pro6-PRAM  PZ,  we  construct  a  deterministic  PRAM  D  that  simulates  PZ  in 
O  (R  (n)  +  T(n)  log  P  ( n ))  time.  The  simulation  follows  that  presented  in  the  proof  of 
Theorem  8.5  with  three  exceptions:  (1)  allocation  of  interleaved  memory  locations  to 
different  choice  sequences,  (2)  addressing  of  memory  cells,  and  (3)  extraction  of  bits  of  #a. 

Exception  1 :  Basic  PRAM  D  cannot  quickly  compute  k-2R{n)  +  am  for  each  memory 
access  to  cell  k.  Instead,  D  allocates  a  block  of  cells  to  each  choice  sequence.  The  processor 
assigned  to  each  choice  sequence  initially  computes  the  starting  address  of  its  block  of  cells. 
PZ  can  only  access  cells  with  addresses  up  to  O  (n 2T{n)).  Thus,  a  block  of  cells  of  size 
0(rt2r{n))  is  allocated  to  each  choice  sequence. 

Exception  2:  The  addresses  of  the  blocks  assigned  to  each  choice  sequence  range  up  to 
0(n  R(n)  2™).  D  computes  a  table  of  the  addresses  of  the  first  cell  of  each  block  in 
0(T(n  )  +  log  R  (n))  time.  D  then  uses  this  table  to  access  the  necessary  cells  in  constant 


time. 
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Exception  3:  D  computes  a  set  of  R  (n)  masks  prior  to  beginning  its  simulation  of  PZ. 
For  0  <  i  <  R(n)  -  l,  maski  =  2‘.  D  computes  these  masks  in  O  ( R  (n))  time.  In  the  previous 
simulation,  processor  P 2m  shifted  cm  right  by  W  bits  after  W  processors  read  bits  of  am. 

Here,  D  cannot  perform  right  shifts,  so  Pim  keeps  track  of  the  last  bit  read  of  am.  With  this 
information,  a  processor  Pj  wishing  to  make  a  random  choice  can  select  the  appropriate  bit 
of  am. 

D  spends  O  (R  (n)  +  T (n))  initialization  time  and  O  (log  P  ( n ))  time  to  simulate  each 
step  of  PZ,  hence,  D  simulates  PZ  in  time  O  (R  (n)  +  T{n)  log  P  (n)).  □ 

8.3.  Markov  Chain  Simulation 

In  this  section,  we  present  a  simulation  of  a  profi-PRAM[*,-!-]  PZ  by  a  deterministic 
PRAM[*,+]  D  in  time  O  ((P  ( n )  +  log  f(n))-S(n)-  log(T(n)))  with  O  ({kP{n)I  (n))3S(,l)) 
processors.  We  achieve  this  by  treating  the  computation  of  PZ  as  a  finite  Markov  chain  in 
which  each  configuration  of  PZ  is  a  state.  Depending  on  the  relative  values  of  T (n),  S  in), 
and  P  (n),  this  simulation  may  be  more  efficient  for  a  PRAM[*,+]  than  the  simulations  in  the 
previous  section. 

Lemma  8.7.1.  (Associative  Memory  Lemma  for profi-PRAMs)  Let  op  c  1*.  +,  T,  I}.  For 
all  T  ( n )  and  S  ( n ),  every  language  recognized  in  time  T  (n)  using  at  most  S  ( n )  cells  by  a 
profi-PRAM[op]  PZ  can  be  recognized  in  time  O  ( T (n))  by  a profi-PRAMfopl  PZ'  that  uses 
0(P{n)S  (n))  processors  and  accesses  only  cells  with  addresses  in  0,  ....  0(S(n)). 

Proof.  The  proof  follows  along  the  same  lines  as  the  proof  of  the  Associative  Memory 
Lemma  (Chapter  3),  with  an  extension  to  account  for  probabilistic  choice.  Replace 
P  ( n)T in)  with  5  (n),  since  in  the  proof  of  the  Associative  Memory  Lemma  P  ( n)T in)  is  used 
as  a  bound  on  the  number  of  accessed  cells. 
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When  a  processor  Pg  of  PZ  executes  an  instruction  r(i)<—r(j)  Or(i);pi,  •  ,  py,  it 
reads  rcong(J)  and  rcong(k),  computes  v  :=  rcong{j)  O  rcong(k),  writes  v  in  rg(i),  and 
uniformly  selects  one  of  {pi,  •  •  • ,  P/j  as  the  next  instruction.  The  corresponding  processor 
Pm  of  PZ'  simulates  the  first  three  parts  of  this  step  just  as  described  in  Chapter  3.  Pm 
performs  the  fourth  part  by  uniformly  selecting  one  of  { Pi ,  •  •  • ,  p/}.  The  time  bound 
follows  directly.  □ 

Theorem  8.7.  Let  PZ  be  a proi>-PRAM[*,-i-}  with  time  bound  T  ( n ),  processor  bound  P  ( n ), 
memory  bound  S (n),  integer  bound  I  (n),  and  program  length  k.  Then  there  is  a 
deterministic  PRAM[*,-H  D  that  simulates  PZ  in  time  0((P(rt)  +  log  / (n))-S  ( n )•  log(T (n))) 
with  O  ((kp<-n)I (n))iS{n))  processors. 

Proof.  Let  PZ'  simulate  PZ  according  to  Lemma  8.7.1.  Then  PZ'  has  time  bound  O  ( T  («)), 
processor  bound  O  (P  (n)S (n)),  memory  bound  S  ( n ),  and  integer  bound  / (n). 

Fix  an  input  of  length  n.  PZ'  has  O  {(kp<-n)l (n))s<-n))  distinct  configurations  with 
memory  bound  S  (n),  since  the  value  of  the  program  counter  of  every  processor  must  be 
considered;  each  can  be  encoded  by  a  distinct  integer  no  more  than  O  ((kP(-n)I  (n))S{n)). 

Each  instruction  in  the  program  of  PZ'  has  up  to  c  choices  of  next  instruction.  Let  d  be 
the  least  common  multiple  of  the  numbers  of  choices.  D  will  view  each  instruction  of  PZ'  as 
having  d  choices  by  duplicating  all  choices.  This  view  preserves  the  probability  of  selecting 
each  of  the  original  choices 

D  activates  (kp (n) I (n))23 ^  processors  in  0(S  (n)(P  (n)  +  log  I (n)))  time,  one 
processor  for  each  pair  of  configurations.  The  processor  number  of  each  of  the 
(kp ^ I (n))25 processors  encodes  a  pair  (t,  u),  where  t  and  u  denote  configurations  of 
PZ' .  D  builds  a  transition  probability  matrix  A.  The  processor  associated  with  pair  (t,  x>) 
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counts  the  number  of  ways  of  reaching  configuration  D  from  configuration  x  in  one  step. 

This  gives  an  integer  in  {0,  1, d)  which,  divided  by  d,  gives  the  probability  of  a  transition 
from  x  to  u. 

We  now  have  the  matrix  dA.  Each  processor  activates  (kP(n>I  (n))S(n}  processors  in 
O  (S  (n  )(P  (n )  +  log  /  (n )))  time  to  be  used  for  squaring  the  matrix.  D  squares  the  matrix 
f  log  T  (/t)"|  times  in  the  straightforward  way.  Each  squaring  takes 

O  ( S(n)(P(n )  +  log  I(n)))  time.  We  now  have  the  matrix  dT^AT<-nK  Entry  (t,  o)  of  AT(n) 
is  the  probability  of  reaching  configuration  u  from  configuration  x  in  T  in)  steps.  D  sums  the 
number  N  of  ways  of  reaching  each  accepting  configuration  from  the  initial  configuration.  If 
2 N  >  dT{n)  (that  is,  if  the  probability  of  reaching  an  accepting  configuration  is  greater  than 
’/2),  then  D  accepts;  otherwise,  D  rejects.  □ 

8.4.  Probabilistic  Circuits 

In  this  section,  we  relate  probabilistic  unbounded  fan-in  circuits  and  CRCW  prob- 
PRAMs  using  the  relationship  between  their  deterministic  counterparts  (Stockmeyer  and 
Vishkin,  1984).  We  find  that,  just  as  in  the  deterministic  case,  time  and  number  of 
processors  of  a  prob- PRAM  correspond  simultaneously  to  depth  and  size  of  a  probabilistic 
unbounded  fan-in  circuit.  Time  and  depth  correspond  to  within  a  constant  factor;  number  of 
processors  and  size  correspond  to  within  a  polynomial. 

A  probabilistic  circuit  PC^m  is  a  circuit  with  n  regular  inputs  and  m  random  inputs. 

Theorem  8.8.  Let  PZ  be  a profc-PRAM  with  time  bound  T (n)  and  processor  bound  P  (n). 
There  is  a  probabilistic  unbounded  fan-in  circuit  PCn  p^j^n)  that  simulates  PZ  in  depth 
0(T(n))  and  size  q(P(n),  T ( n ),  n ),  where  q  {P,  T,  n)  is  bounded  above  by  a  polynomial  in 


Proof.  Theorem  2.1  states  the  deterministic  result  of  Stockmeyer  and  Vii  hkin  (1984).  We 
modify  the  proof  given  by  Stockmeyer  and  Vishkin  for  the  simulation  of  a  deterministic 
PRAM  by  a  deterministic  circuit.  Assume  that  all  probabilistic  choices  made  by  PZ  are 
between  two  alternatives.  The  circuit  presented  by  Stockmeyer  and  Vishkin  has  an  identical 
carton  of  gates  for  each  time  step  and  each  processor.  Each  canon  is  made  of  13  blocks.  We 
add  a  random  input  bit  to  one  of  these  blocks:  [Update-ic].  This  block  selects  the  next 
instruction  to  be  executed  by  the  simulated  processor.  If  the  instruction  currently  being 
executed  calls  for  a  random  choice,  then  the  next  instruction  is  selected  based  on  the  value  of 
the  random  bit.  PCn<p(n)T(n)  keeps  the  same  size  and  depth  bounds  as  stated  in  Theorem  2.1. 
□ 

Theorem  8.9.  Let  PCnm  be  a  probabilistic  unbounded  fan-in  circuit  of  size  S  and  depth  T 
with  n  regular  inputs  and  m  random  inputs.  There  is  a  prob- PRAM  PZ  that  simulates  PC  in 
time  O  (T  +  log  m)  with  O  (5  +  n)  processors. 

Proof.  Theorem  2.2  states  the  deterministic  result  of  Stockmeyer  and  Vishkin  (1984).  We 
modify  the  proof  given  by  Stockmeyer  and  Vishkin  for  the  simulation  of  a  deterministic 
circuit  by  a  deterministic  PRAM.  The  only  difference  between  a  probabilistic  circuit  and  a 
deterministic  circuit  is  the  m  random  inputs  in  the  probabilistic  circuit.  PZ  activates  m 
processors  in  O  (log  m)  time.  Each  processor  simulating  one  of  the  random  inputs  randomly 
chooses  a  value,  then  writes  that  value  to  the  cell  corresponding  to  its  random  input. 

The  remainder  of  the  simulation  follows  as  in  Stockmeyer  and  Vishkin  (1984).  □ 
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Chapter  9.  Simulation  by  Sequential  Machines 

We  present  here  simulations  of  PRAMs  with  enhanced  instruction  sets  by  RAMs  with 
the  same  instruction  set  through  uniform,  bounded  fan-in  circuits.  We  will  prove  that  a 
RAM[op]  can  efficiently  simulate  a  uniform,  bounded  fan-in  circuit  and  then  show  that  the 
circuits  presented  earlier  that  simulate  a  PRAMfop]  meet  the  uniformity  conditions. 

9.1.  Definitions 

We  use  the  following  definitions  relating  to  circuits  (Ruzzo,  1981). 

•  A  circuit  is  a  directed  acyclic  graph,  where  each  node  (gate)  with  indegree  d  >  0  is  labeled 
by  the  AND,  OR,  or  NOT  of  d  variables,  and  each  node  with  with  indegree  0  is  labeled  by 

“ inp ”  (an  input).  Nodes  with  outdegree  0  are  outputs. 

•  A  circuit  family  C  is  a  set  (Ci ,  C  2,  •  •  •  }  of  circuits,  where  Cn  has  n  inputs  and  one 
output.  We  restrict  the  gate  numbering  so  that  the  largest  gate  number  is  (Z ( n))0(l) ,  where 
Z  (n)  is  the  size  of  C„.  Thus  the  gate  numbers  coded  in  binary  have  length  O  (log  Z  ( n )). 

•  The  family  C  recognizes  Ac(0,l)  *  if  for  each  n,  C„  recognizes  Aw=An  {0,1 1",  that 
is,  the  value  of  Cn  on  input  inp  1 ,  inp 2,  ....  inp„  e  { 0, 1 )  is  1  if  and  only  if  inp  1  •  ■  •  inp„  e  A . 
If  Cn  has  at  most  Z (n)  gates  and  depth  D  (n),  then  the  size  complexity  of  C  is  Z(n)  and  the 
depth  complexity  is  D  (n).  A  language  Q  is  of  simultaneous  size  and  depth  complexity  Z (n) 
and  D  (n)  if  there  is  a  family  of  circuits  of  size  complexity  Z(n)  and  depth  complexity  D  (n) 
that  recognizes  Q. 

•  A  bounded  fan-in  circuit  is  a  circuit  where  the  indegree  of  all  gates  is  at  most  2.  For  each 
gate  g  in  Cn,  let  g  (X)  denote  g,  g  (L)  denote  the  left  input  to  g,  and  g  ( R )  denote  the  right 
input  to  g. 


104 


•  An  unbounded  fan-in  circuit  is  a  circuit  where  indegree  is  unbounded.  For  each  gate  g  in 
Cn,  let  g  (X)  denote  g,  and  let  g  (p),p  =  0,  1,2, ....  denote  the  pth  input  to  g. 

•  The  bounded  direct  connection  language  of  the  family  C  =  [C C2,  ■  •  •  },  Lfioc,  is  the 
set  of  strings  of  the  form  <n,  g,  p,  h>,  where  n,  g  e  {0,  l)*,p  e  [X,L,R),he  [inp,  AND, 
OR,  NOT}  u  {0,1  }*  such  that  in  Cn  either  (i)  p  =  X  and  gate  g  is  a  h- gate,  h  e  [inp,  AND, 
OR,  NOT},  or  (ii)  p  *  X  and  gate  g(p)  is  numbered  h,  h  e  (0,1  }*. 

•  The  unboun  ded  direct  connection  language  of  the  family  C  =  {(?! ,  C2,  •••  },  L(/dC,  is  the 
set  of  strings  of  the  form  <n,  g,  p,  h> ,  where  n,  g  e  (0,  1  }*,p  e  [X]  u  {0,  1, ...,  Z(n)oa) }, 
h  €  [inp,  AND,  OR,  NOT}  u  {0,1}*  such  that  in  C„  either  (i)p  =  X  and  gate  g  is  a  6-gate,  6 
e  { inp,  AND,  OR,  NOT},  or  (ii)  p  *  X  and  gate  g  (p)  is  numbered  h,  h  e  {0,1  }*. 

Let  us  now  introduce  two  new  definitions  of  uniformity.  Let  /  (Z  (n))  be  the 
concatenation  of  all  pairs  (g,  h),  where  g,  he  {0,  1, ...,  Z(n)0(1)}. 

•  The  family  C  ~  [C C2,  •  •  •  }  of  bounded  (respectively,  unbounded)  fan-in  circuits  of 
size  Z(n)  is  VM-uniform  if  there  is  a  RAM[T,i]  that  on  input  /(Z(n))  returns  an  output 
string  in  O(log  Z(n))  time  indicating  for  each  pair  (g,  h)  whether  <n,  g,  L,  h>  is  in  LBdc 
and  whether  <n,  g,  R,  h>  is  in  Lbdc  (respectively,  indicating  for  each  pair  (g,  h)  the  value 
of  p  such  that  <n,  g,  p,  h>  is  in  Lupc,  forp  =  0,  1, ...,  Z(n)°^\  or  an  indication  that  no 
<n,  g,  p,  h>  is  in  Ludc )•  (Note:  We  chose  the  term  VM-uniform  because  Pratt  and 
Stockmeyer  (1976)  called  their  restricted  RAM[T,i]  a  vector  machine.) 

•  The  family  C  =  [C Ci,  •••}of  bounded  (respectively,  unbounded)  fan-in  circuits  of 
size  Z(n)  is  MRAM -uniform  if  there  is  a  RAM[*]  that  on  input  I  (Z  (n))  returns  an  output 
string  in  0(log  Z(n))  time  indicating  for  each  pair  (g,  h)  whether  <n ,  g,  L,  h>  or 

<n,  g,  R,  h  >is  in  Lbdc  (respectively,  indicating  for  each  pair  (g,  h)  the  value  of  p  such  that 
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<n,  g,  p,  h>  is  in  I(/dc>  for  p  =  0,  1, Z(rt)0(1\  or  an  indication  that  no  <n ,  g,  p,  h>  is 
in  L(jdc)-  (Note:  We  chose  the  term  MRAM-uniform  because  Hartmanis  and  Simon  (1974) 
called  their  RAM[*]  an  MRAM.) 

•  A  gate  g  is  at  level  j  of  Cn  if  the  longest  path  from  any  circuit  input  to  g  has  length  y.  Gate 
g  is  at  height  j  of  Cn  if  the  longest  path  from  g  to  the  output  has  length  y. 

•  Let  Cn  be  a  bounded  fan-in  circuit  consisting  entirely  of  AND,  OR,  and  inp  gates  with 
depth  D  ( n ).  We  construct  the  circuit  CT (C„),  the  circuit  tree  of  C„,  from  C„.  Let  gate  a  be 
the  output  gate  of  Cn  and  let  a  be  of  type  4>  e  { AND,  OR)  with  inputs  from  gates  b  and  c. 
Then  the  output  gate  of  CT ( Cn )  has  name  (0,  a),  type  0,  and  inputs  from  gates  named  (1,6) 
and  (2,  c).  Thus,  gate  (0,  a)  is  the  gate  at  height  0  of  CT (C„)  and  gates  (1,  b)  and  (2,  c)  are 
the  gates  at  height  1  of  CT  (C„).  Now  suppose  that  we  have  constructed  all  gates  at  height  y 
of  CT (C„),  and  we  wish  to  construct  the  gates  at  height  y+1.  Each  gate  (i,  e)  at  height  y 
corresponds  to  a  gate  e  in  Cn.  If  e  is  of  type  0  e  (AND,  OR),  then  gate  (t,  e)  is  of  type  0. 
Suppose  gate  e  has  inputs  from  gates  / and  g.  Then  the  inputs  to  gate  ( i ,  e)  of  CT(Cn)  at 
height  y  +  1  are  the  gates  (2t'+l,/)  and  (2i+2,  g).  If  gate  e  is  of  type  inp,  that  is,  an  input,  and 
j  <  D  ( n ),  then  (i,  e)  is  of  type  OR  (if  (i,  e)  is  at  an  even  numbered  level)  or  type  AND  (if 
(t,  e)  is  at  an  odd  numbered  level),  and  the  inputs  to  gate  (/,  e)  at  height  y+1  are  the  gates 
(2i+l,e)  and  (2t+2,  e).  If  gate  e  is  of  type  inp  and  j-D  (n),  then  (t,  e)  is  of  type  inp  and 
CT (Cn)  has  no  gates  at  height  y+1  connected  to  gate  (t,  e).  Figure  9.1  contains  an  example 
of  a  circuit  tree. 

•  In  CT (C„),  define  path  ( a ,  b)  to  be  the  path,  if  one  exists,  from  gate  a  to  gate  b. 

•  In  CT  ( Cn ),  the  distance  from  gate  b  to  gate  c  is  the  length  of  the  shortest  path  from  b  to  c, 
if  such  a  path  exists.  We  order  all  gates  at  distance  d  from  gate  b  according  to  the  relation 


) 

13,*) 


(14,/) 


107 


order  (e,  f)  such  that  order  (e,  f)  is  true  if  path  (e,  b)  intersects  path(j ,  b)  at  a  gate  which 
path  (e,  b)  enters  at  the  left  input  and  path  (f,  b)  enters  at  the  right  input.  Gate  e  is  the  qth 
ancestor  of  gate  b  at  distance  d  if  gate  e  is  the  qth  smallest  gate  in  the  ordering  of  the  gates  at 
distance  d.  We  say  that  the  smallest  gate  at  distance  d  is  the  0th  ancestor.  In  the  example 
above,  gate  (10,  j)  is  the  3rd  ancestor  of  gate  (0,  a)  at  distance  3.  Note  that  the  same  gate  in 
C„  can  correspond  to  several  ancestors  of  a  gate  at  distance  d  in  CT (C„). 

•  A  double-rail  circuit  is  a  bounded  fan-in  circuit  that  is  given  as  input  inp  inp 2 ,  inpn 
and  their  complements  inp  1 ,  inp 2,  ....  inpn  and  that  contains  no  NOT-gates. 

Note  that  every  gate  in  a  double-rail  circuit,  except  the  input  gates,  has  exactly  two 
inputs. 

•  A  layered  circuit  is  a  double-rail  circuit  such  that  all  gates  at  level  i,  for  all  odd  i,  are 
AND-gates  and  all  gates  at  level  i,  for  all  even  i,  are  OR-gates,  and  each  input  to  a  gate  at 
level  i  is  connected  to  an  output  of  a  gate  at  level  i  -  1. 

Lemma  9.1.1.  Let  C={Ci,C2,  •  •  •  ]  be  a  VM-uniform  (MRAM-uniform)  family  of 
bounded  fan-in  circuits  of  size  Z  (n)  and  depth  D  (n)  recognizing  language  L.  There  exists  a 
VM-uniform  (MRAM-uniform)  family  of  bounded  fan-in,  double-rail  circuits  E  = 

{Ei,  E  2,  •  •  •  )  of  size  O  ( Z  ( n ))  and  depth  D  (n)  recognizing  language  L. 

Proof.  Fix  an  input  size  n.  We  construct  En  from  C„.  Suppose  that  for  each  input  inp,  we 
have  its  complement,  inp,-.  For  each  gate  g  of  type  y  e  {inp,  AND,  OR)  in  C„,  circuit  En  has 
a  gate  g  of  type  y.  En  also  has  a  gate  g':  if  y  =  AND,  then  g'  is  of  type  OR;  if  y  =  °R,  then 
g '  is  of  type  AND;  and  if  y  =  inp  and  gate  g  is  input  inpit  then  gate  g '  is  of  type  inp  and  is 
input  inpj.  Suppose  h  is  an  input  to  gate  g  in  Cn.  If  h  is  of  type  y  e  {inp,  AND,  OR),  then  in 
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£„,  gate  h  is  also  an  input  to  gate  g  and  gate  h'  is  an  input  to  gate  g If  h  is  of  type  NOT 
with  input  /,  then  in  £„,  gate  /'  is  an  input  to  g  and  gate  /  is  an  input  to  g 

Thus,  £  is  a  VM-uniform  (MRAM-uniform)  family  of  double-rail  circuits  with  size 
O  (Z  (n))  and  depth  D  ( n )  recognizing  language  L.  □ 

Lemma  9.1.2.  Let  £  =  { £  i ,  £2.  ‘••}bea  family  of  VM-uniform  (MRAM-uniform), 
bounded  fan-in,  double-rail  circuits  of  size  Z(n)  and  depth  D  (n)  recognizing  language  L. 
There  exists  a  family  of  VM-uniform  (MRAM-uniform),  bounded  fan-in,  layered  circuits  F 
=  {£  1 ,  £?,  •  ■  •  1  of  size  O  (Z(n))  and  depth  O  (D  ( n ))  recognizing  language  L. 

Proof.  Fix  an  input  size  n.  We  construct  Fn  from  £„.  Suppose  y  is  odd  and  that  we  have 
constructed  y  -1  levels  of  £„  from  the  first  i  -1  levels  of  £„.  We  construct  level  y  of  Fn. 

First  assume  that  all  inputs  to  gates  in  level  i  of  £„  are  outputs  from  level  i  -  1.  If  all  gates 
in  level  i  of  £„  are  AND-gates,  then  level  j  of  Fn  is  identical  to  level  1  of  £„,  and  we  move 
on  to  construct  level  j +1  of  Fn.  Otherwise,  for  each  AND-gate  in  level  i  of  £„,  we  place  an 
AND-gate  in  the  corresponding  place  in  level  j  of  Fn.  For  each  of  these  AND-gates  h  in  Fn 
in  level  j,  there  is  an  OR-gate  g  in  level  y+1  with  two  inputs  h.  For  each  OR-gate  in  level  i 
of  £„,  we  place  an  OR-gate  in  the  corresponding  place  in  level  y+1  of  Fn.  For  each  input  / 
of  these  gates,  we  place  an  AND-gate  in  level  j  with  two  inputs/.  Then  we  move  on  to 
construct  level  y +2  of  Fn.  If  j  is  even,  we  construct  level  y  of  Fn  similarly. 

Now  we  remove  the  assumption  that  all  inputs  to  gates  in  level  i  of  £„  are  outputs  from 
level  i  -  1.  Construct  £„  as  described  above  with  one  exception.  If  an  input  to  a  gate  g£  in 
£„  at  level  i  is  the  output  from  gate  h f  at  level  i' ,i'  *i  -  1,  then  do  not  directly  connect  the 
output  from  the  corresponding  gate  hp  to  an  input  of  gp  in  F„.  Instead,  between  gates  hp  and 
gp  insert  alternating  AND-  and  OR-gates,  with  AND-gates  at  odd  levels  and  OR-gates  at 
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even  levels,  and  let  both  inputs  to  a  gate  at  level  k  be  connected  to  the  output  of  the  newly 
inserted  gate  at  level  k  -  1. 

Thus,  F  is  a  VM-uniform  (MRAM- uniform)  family  of  layered  circuits  with  size 
O  (Z {n))  and  depth  0(D(n))  recognizing  language  L.  □ 

9.2.  Simulation  of  VM-Uniform  Circuit  by  RAM[1\i] 

In  this  section,  we  restructure  a  VM-uniform,  bounded  fan-in  circuit,  then  simulate  the 
restructured  circuit  on  a  RAM[t,l],  The  RAM[t,i]  will  operate  on  a  circuit  tree  rather  than 
the  original  circuit  because  the  RAM[1\X]  can  very  easily  run  a  circuit  tree  on  an  input,  once 
the  input  bits  are  properly  placed.  The  RAM[t,i]  will  also  split  the  circuit  into  slices  of 
depth  O  (log  Z  (n)).  In  this  way,  the  RAM[t,l]  will  balance  the  time  to  generate  the  circuit 
with  the  time  to  run  the  circuit. 

Let  C  =  {C  i ,  C2,  •  •  •  )  be  a  VM-uniform  family  of  bounded  fan--.n  circuits  of  size 
Z (n)  and  depth  D  ( n )  recognizing  language  L.  For  an  integer  x,  let  #x  denote  its  two’s 
complement  representation;  the  number  of  bits  in  the  two’s  complement  representation  will 
be  clear  from  the  context.  We  now  describe  how  a  RAM[T,1]  can  simulate  C. 

Simulation.  Fix  an  input  length  n.  Circuit  C„  has  size  Z (n)  and  depth  D  (n).  We 
construct  a  RAM[t,i]  R  that  recognizes  in  time  0{D  ( n )  +  log  Z(n)  loglogZ(n)). 

By  Lemmas  9.1.1  and  9.1.2,  we  construct  a  layered  circuit  Fn  from  Cn  with  size 
O  (Z  (n))  and  depth  0  ( D  ( n ))  that  recognizes  language  L{n).  Machine  R  simulates  Cn  via  Fn. 
For  simplicity,  let  us  say  that  Fn  has  depth  D(n)  and  size  Z  (n),  and  that  all  gates  of  Fn  are 
numbered  from  {0,  1,...,  Z(n)-l). 
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Let  us  first  outline  the  simulation. 

Stage  1.  R  generates  a  Z(n)  x  Z{n)  ancestor  matrix  A  in  which  each  entry  ( g ,  h) 
indicates  whether  gate  h  is  an  input  of  gate  g  in  Fn. 

Stage  2.  R  operates  loglog  Z (n)  times  on  matrix  obtaining  matrix  A  log  Z(„). 
Stage  3.  R  extracts  from  A  iog  z(n)  a  description  of  the  circuit  in  slices  of  depth 
O  (log  Z  ( n ))  and  their  circuit  trees. 

Stage  4.  R  runs  each  slice  consecutively  on  the  input. 

Stage  1 :  Computation  of  ancestor  matrix.  Each  entry  in  ancestor  matrix  A  is  a  bit  vector 
Z (n)  bits  long.  Entry  ( g ,  h)  has  a  1  in  bit  position  0  (1)  if  gate  h  is  the  left  (right)  input  to 
gate  g;  otherwise,  bit  position  0  (1)  holds  0.  All  other  bit  positions  hold  0.  Let  A  !  =  A.  In 
general,  R  operates  on  A,  to  produce  A^.  After  loglog  Z (n)  operations,  A  iogZ(/«)Q?>  h)  holds 
1  in  each  bit  position  j  if  gate  h  is  the  y'th  ancestor  of  gate  g  at  distance  log  Z(n),  0  otherwise. 

R  will  first  write  all  pairs  (g,  h),  where g,  he  (0,  1, .... Z (n)-l ),  concatenated  in  a 
single  register  in  O  (log  Z  («))  time.  We  view  a  register  as  the  concatenation  of  Z2(n )  slots, 
each  slot  Z  (n)  bit  positions  long.  Pair  (g,  h)  is  written  in  slot  gZ  ( n )  +  h  with  the  least 
significant  bit  of  #g  in  the  0th  bit  position  in  the  slot  and  the  least  significant  bit  of  #h  in  the 
Z  (n)/2th  bit  position  in  the  slot.  R  constructs  the  first  component  of  every  pair  one  bit 
position  at  a  time,  then  the  second  component  of  every  pair  one  bit  position  at  a  time. 

We  now  describe  how  R  builds  the  first  component  of  each  pair;  R  builds  the  second 
component  similarly.  We  build  the  first  component  as  a  bit  vector  with  #g  in  each  of  slots 
gZ(n),...,(g+l)Z(n)  -  1,  for  each  g,  0  <  g  <  Z(n)-1.  Let  v  denote  this  bit  vector. 

R  first  constructs  a  bit  vector  %  in  which  each  slot  gZ (n),  0  <  g  <  Z (n)  -  1 ,  holds  #g 
Let  q  =  log  Z  ( n ).  Each  integer  g  is  q  bits  long.  R  constructs  %  in  q  phases,  generating  one  bit 
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position  at  a  time  for  all  g.  Let  maski  denote  the  bit  vector  where  #maskt  has  a  1  in  the  ith  bit 

position  of  each  slot  and  0’s  elsewhere.  Let  S,  =  4  A  rnaski\  that  is,  in  the  ith  bit  position  of 

<?-< 

each  slot,  #S,  equals  #£  and  is  0’s  elsewhere.  Thus,  4  =  ^  Si .  In  phase  i,  R  constructs  S^-i, 

i=0 

using  O  (q)  time  in  phase  1  and  0(1)  time  in  every  other  phase. 

To  build  S?_  i ,  start  with 

tl  4-  1  T  [(Z  (n))(Z  (n)/2)  +  q  -  1], 
then  for  j  =  l,  ...,  <7-1,  execute  the  following: 
uj  4-  tj  T  (Z(#i)-2y_1) 
r;+i  h  v  uj- 

Finally,  Sq-  1  =  tq,  and  R  writes  5?_i  in  rx  and  r2. 

At  the  stan  of  phase  i,  register  r  j  contains  v  2  V  •  •  •  V  Sq-l+\ ,  and  register  r  2 
contains  Sq-l+\ .  R  constructs  S,.,  from  59_;+ 1  as  follows: 

r3  4—  r2  1  (2 *“‘)(Z2(n»  (*  Z(n)  is  the  slot  size  and  slots  holding 

g’s  areZ(n)  apart  *) 
r4  4-  r2  A  r3  (*  half  of  the  1  ’s  in  ttrcon  (2)  *) 

r5  r2@r4  (*  the  other  half  of  the  l’s  in  ttrcon  (2)  *) 

r6  4-  r4  i(2^)(Z2(n)) 

r 2  «“  r5  ^  r6 
rt  4-  rj  Vr2 

and  r  1  holds  S^-i  v  •  •  •  v  S?_<+1  v  5?_„  and  register  r2  holds  S^-,. 

Each  phase  after  the  first  takes  0(1)  time,  and  R  builds  ^  in  O  (<7)  =  O  (log  Z(n))  steps. 

/?  next  builds  x>  from  ^  by  filling  in  the  empty  slots  in  log  Z  ( n )  phases.  Let  1 1  =  £.  In 
phase  t,  R  takes  the  output  of  phase  t-1  and  computes 
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Ui  <r-  r,  T  Z(n)-2‘ 
ti+l 

We  set  v  =  t  iogZ  (n>+i  •  In  this  manner,  R  builds  the  first  component  of  each  pair  in 
O  (log  Z(n))  time. 

R  builds  the  second  component  of  each  pair  similarly.  R  now  has  a  register  containing 
the  concatenation  of  all  pairs  (g,  h),0<g,  h  <,Z  («)  -  1.  For  each  pair  (g ,  h),  R  determines 
simultaneously  whether  gate  h  is  an  input  to  gate  g.  If  gate  h  is  the  left  (right)  input  to  gate 
g,  then  R  writes  a  1  in  the  first  (second)  position  in  the  slot.  By  VM-uniformity,  this  process 
takes  O  (log  Z(n))  time.  The  resulting  bit  vector  constitutes  the  ancestor  matrix  A. 

Stage  2 :  Computation  of  distance  log  Z  (n )  ancestor  matrix.  Pratt  and  Stockmeyer  (1976) 
proved  that  given  two  zx  z  Boolean  matrices  A  and  B ,  a  RAM[T,i]  can  compute  their 
Boolean  product  G,  defined  by 

G(i,  j )  =  V  (A  (i,  k)*B(k,  ;)), 

k 

in  O  (log  z)  time.  The  RAM[t,l]  performs  the  AND  of  all  triples  i,  j,  k  in  one  step  and  the 
OR  in  O  (log  z)  steps.  Let  Qd  be  a  function,  specified  below,  with  two  bit  vectors  as  inputs 
and  one  bit  vector  as  output.  Given  two  zx  z  matrices  A  and  B  whose  elements  are  bit 
vectors  m  bits  long,  let  us  define  the  function  Hd(A,  B)  =  G,  where 

G(i,  j)  =  V  Qd{A(iy  k),  B  (k,  ;)). 

k 

We  prove  that  a  RAM[1\i]  can  compute  the  matrix  G  =  Hd(A,  B)  in  0(log  z  +  log  m)  rime 
The  RAM[t,i]  performs  on  all  triples  i,  j,  k  in  U  (log  z  +  log  m)  steps  and  the  OR  in 
O  (log  z)  steps.  Since  we  have  the  matrix  multiplication  algorithm  of  Pratt  and  Stockmeyer, 
we  prove  that  a  RAM[T,1]  can  compute  Qd(A  O',  k ),  B  (k,  j))  in  G(log  m)  time  to  establish 
this  bound.  In  our  case,  m  =  z  =  Z(n). 


113 


Suppose  R  has  operated  log  d  times  on  A.  Call  the  resulting  matrix  Ad.  A  1  in  bit 
position  i  of  Ad(f,  g)  indicates  that  gate  g  is  the  tth  ancestor  of  gate  / at  distance  d.  We  want 
£).  A^C?,  h ))  to  return  a  bit  vector  A  ^(Z,  h)  with  a  1  in  bit  position  i  if  gate  h  is 
the  ith  ancestor  of  gate  /  at  distance  2 d,  and  0  in  bit  position  i  otherwise. 

Assume  m  is  a  power  of  2.  Let  x  =  2d.  Let  a  and  (3  be  integers  x  bits  long:  # a  = 
ax_i  •  ■  •  dido  and  #(3  =  (5*-!  •  •  •  PiPo-  We  define  the  function  9d(a,  P)  so  that  for  each  1  in 
bit  position  a  in  # a  and  each  1  in  bit  position  b  in  #P,  in  the  result  y  =  9^(a,  P),  #y  has  a  1  in 
position  a*  2d  +  b.  If  either  a  or  P  is  0,  then  y  is  0. 

Before  describing  how  R  performs  9<*,  we  define  two  functions  SPREAD  and  FILL  that 
R  will  use  to  perform  0^ .  The  function  SPREAD  (#a,  y)  returns  the  bit  vector  Ha'  = 
otx-jO  ■  •  •  0ax_20  •  •  •  OctiO  •  •  •  Octo  in  which  al+1  is  y  bit  positions  away  from  a,,  separated 
by  0’s,  for  all  0  <  i  <  x  -  2. 

Lemma  9.1.3.  For  any  y,  R  can  perform  SPREAD  (#a,  y)  in  time  O  (log  x),  where  #a  is  x 
bits  long. 

Proof.  R  performs  SPREAD  in  O  (log  jc)  phases  of  mask  and  shift  operations.  Note  that  the 
subscript  of  each  bit  in  Ha  specifies  its  position.  In  phase  i,  R  uses  maski  to  mask  away  all 
bits  of  #a  whose  subscript  has  a  0  in  the  (log  x  -  i)th  position.  R  then  shifts  the  remainder  of 
the  string. 

In  particular,  R  creates  mask \  in  three  steps:  maski  ^  Ct/2);  maski  <—  mask i  -  1; 
mask  i  mask \  T  (jc  12).  Thus,  Hmask  j  has  l’s  in  its  x  12  most  significant  positions  and  0's 
in  the  remaining  x/2  bit  positions.  (We  can  readily  divide  by  powers  of  2  using  the  right 
shift  operation  and  multiply  by  powers  of  2  using  the  left  shift  operation.)  In  general,  #maskt 
has  a  1  in  every  position  p  such  that  p  has  a  1  in  the  (log  x  -  i)th  bit  position. 
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Let  temp  x  =  a.  We  let  tempi  hold  intermediate  results  as  R  spreads  the  bits  in  a  to  get 
a'. 

Given  tempi  and  maski,  R  constructs  fempl+1  and  mask,+i  in  phase  i  as  follows, 
temp,(l)  tempi  A  maski  (*  mask  away  bits  of  #a  whose  subscript 

has  0  in  the  (log  x  -  i)th  bit  position  *) 
tempii 2)  <—  tempi@tempi(  1)  (*  the  bits  of  #a  masked  away 

in  the  previous  step  *) 

temp,{  1)  tempii  1)  T  ((jc/2‘)*  (y-1))  (*  spread  the  unmasked  bits  *) 

temp,+\  <—  tempii  1)  v  temp,{ 2)  (*  combine  with  masked  bits  *) 

maski(l)  <—  mask,  1  (jc/21+1  ) 
maski( 2)  <—  mask, ( 1 )  a  maski 

mask, {3)  mask, (\)®  mask, {2)  (*  separate  bits  of  mask, 

into  two  groups  *) 
mask, {A)  <-  mask,{2)  T  [0c/2‘)*(y-l)+(;c/2'+1)] 
mask,+\  <—  mask,  (3)  V  mask, (A) 

In  O  (log  x)  phases,  each  taking  a  constant  amount  of  time,  R  performs 
SPREAD  (#a,  y).  □ 

Let  as  =  SPREAD  (#a,  y).  The  function  FILL  (aJ,  y)  returns  the  value  of,  where  #o/ 
ax.1(Xr.2  ' '  '  «iCti  '  '  ’  otjctooto  ■  ■  Oo,  in  which  positions  iy,  ...,  (t'  +  l)y-l 
have  value  a,.  Assume  y  is  a  power  of  2.  (Note:  One  may  think  of  aJ  as  a  spread  out  and 
of  as  ot5  filled  in.) 

Lemma  9. 1.4.  R  can  perform  FILL  {<xs ,  y )  in  time  O  (log  y ), 
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Proof.  R  performs  FILL  (a5,  y )  in  O  (log  y)  phases  of  shift  and  OR  operations.  Let  af(i) 
represent  the  outcome  of  phase  i. 

temp  <-  af(i)  T  2‘_1 
c/(t+l)  4-  af(i)  v  temp 

Then  FILL  (c/f ,  y)  =  a f  =  o / (log  y).  Each  phase  takes  constant  time,  so  R  performs 
FILL  ( as ,  y)  in  <9  (log  y)  time.  □ 

Now  we  describe  how  a  RAM[t,i]  R  computes  Q^a,  P)  in  O(log  jr)  =  O  (log  m)  steps. 
R  first  computes  cts  =  SPREAD  (#a,  2d)  in  0  (log  x)  time.  Each  1  in  position  i  in  it a 
produces  a  1  in  position  i-2d  of  # cc1.  R  concatenates  x-2d  copies,  each  m  bits  long,  of  #<xs 
in  O  (log  x)  time.  Let  squid  denote  the  value  of  this  concatenation.  R  then  computes  P*  = 
SPREAD  (#p,  m)  in  O  (log  x)  time;  this  aligns  a  bit  of  #P  with  each  copy  of  in  Hsquid. 

R  computes  =  FILL  (p5,  m).  R  then  performs  squid  <—  squid  a  [V,  which  blocks  out 
each  copy  of  #as  in  it  squid  that  corresponds  to  a  0  in  #p. 

Next  we  explain  how  for  each  nonzero  bit  P j  of  #P,  R  shifts  the  y'th  copy  of  it  a?  to  the 
left  by  j  bits  in  O  (log  x)  phases  of  mask  and  shift  operations.  In  phase  i,  R  masks  away  all 
copies  of  it as  corresponding  to  nonzero  bits  P;  for  which  the  (log  x  -  i )th  position  of  ttj  is  0. 
then  shifts  the  remainder  of  the  vector. 

R  creates  mask  i  as  follows:  mask  i  <—  1  T  (mx  12);  mask  i  4—  maskx  -  1; 
maskx  4 —  maskx  T  (mxl 2).  Thus,  ttmaskx  has  l’s  in  its  mx/2  most  significant  positions  and 
0’s  in  the  remaining  mx/2  bit  positions. 
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Let  temp  i  =  squid. 

Given  tempi  and  maski,  R  constructs  templ+)  and  maskl+\  in  phase  i  as  follows. 
temp  A  1)  <—  tempi  A  maski 

tempi(2)  <—  tempi®  temp ;(1)  (*  split  bits  of  temp,  into  two  sets  *) 

tempi ( 1 )  <—  tempi ( 1 )  T  (2log  x  ~ 1 )  (*  shift  one  set  *) 

tempi+ 1  <—  tempii  1)  v  temp ^2)  (*  recombine  *) 

maskii  1)  <—  mas/:,  i  (m/2‘+1) 

maskii 2)  mas/t,(l)  A  mas/:,- 

maski(3) «—  maskt ( 1 )  @ maskt(Z)  (*  split  bits  of  maski  into  two  sets  *) 
maskii 4)  <—  maskii 2)  T  (x/2‘)  (*  shift  one  set  *) 

maskii i  <—  maskii 3)  v  maskii 4)  (*  recombine  *) 

Let  a  block  of  size  m  of  #y  =  y*  •  •  •  y0  be  a  set  of  bits  ■  •  •  Y/nt-  Finally,  R  ORs 

together  all  blocks  of  size  m  in  0  (log  m)  steps.  (Formerly,  each  block  of  size  m  was  a  copy 
of  #as\  now  some  have  been  shifted.)  The  resulting  bit  vector  is  9d(a,  (5).  R  has  computed 
0d(a,  |3)  in  O  (log  m)  steps. 

Recall  that  we  defined  HdiA,  B )  =  G,  where 

gu,  j)  =  V  ed(A  ii,  k),Bik ,  y». 

To  perform  //j(A,  5)  on  two  zx  z  matrices  A  and  B  in  O  (log  z  +  log  m)  time,  we  must  show 
that  we  can  perform  9^(A  ii,k),  B  ik,j))  in  O  (log  z  +  log  m)  time  when  the  matrices  A  and  B 
are  given  as  bit  vectors.  We  simply  allow  enough  space  between  elements  in  the  bit  vectors 
A  and  B  so  that  operations  on  adjacent  pairs  do  not  interfere  with  each  other,  and  we  can 
generate  all  masks  and  perform  all  operations  in  O  (log  z  +  log  m)  time. 
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Given  the  Z(n)xZ(n)  ancestor  matrix  A  x  =  A  with  entries  Z (n)  bits  long,  R  computes 
A2  =  H X(A\,  A  i),  then  Ay  =Hj(Aj,  Aj),  for  each  j  =  1, 2,  4,  •  •  • ,  log  Z(n)  -  1.  Let  G  = 

A  i0g  z(n)-  Each  Hd  operation  takes  time  0  (log  Z  in))  to  compute,  and  R  executes  Hd  for 
loglog  Z(n)  values  of  d.  Hence,  R  computes  G  in  O  (log  Z (n)  loglog  Z ( n ))  time. 

Stage  3:  Extraction  of  slices  and  circuit  trees.  We  partition  Fn  into  D  (n )  /  log  Z  (n )  slices 
of  depth  logZ(n)  each.  Let  v  =  D  in)  /  log  Z(n).  The  y'th  slice  comprises  levels  j  log  Z(n), 
....  (j+l)  log  Z(n)  -  1,  for  0  <.  j  <  v  -  1.  Let  10)  denote  the  yth  slice. 

R  will  extract  circuit  tree  descriptions  of  each  slice  from  matrix  G,  starting  with  I(v-l). 

We  introduce  a  simple  procedure  COLLAPSE ,  which  R  will  use  to  extract  information 

from  matrix  G.  Procedure  COLLAPSE  (a,  z)  takes  as  input  the  value  a,  where  #a  = 

(V-i  ' ' '  ai  °k).  and  returns  the  value  (3,  where  #(3  =  (3zi_i  •  •  •  (30,  and  bits  (3^  = 

2-1 

V  afa  +  .  ,  for  0  <  k  <,  z  -  1,  and  (3,  =  0  if  i  *  kz. 
o 

Lemma  9.1.5.  R  can  perform  COLLAPSE  (a,  z)  in  O  (log  z)  time. 

Proof.  Let  temp  i  =  a.  For  i  =  0, ...,  log  z  -  1,  R  performs  the  following  steps: 
tempi(  1)  <—  temp,  i  2' 
tempi+x  <—  temp  temp  A 1). 

z-l 

Let  denote  temp\0%t.  At  this  point  #\j/  =  Oz2_i  •••  Oi  Vo.  where  Now  we 

mask  away  all  bits  y,  for  i  ^  kz.  This  takes  O  (log  z)  time,  and  the  result  is  p.  □ 

Matrix  G  is  stored  in  a  single  register  in  row  major  order.  We  view  the  contents  of  this 
register  as  both  a  matrix  of  discrete  elements  and  as  a  single  bit  string.  We  call  the  portion 
of  a  matrix  comprising  one  row  a  box.  We  call  the  portion  of  a  box  containing  one  element 


of  a  row  a  slot. 
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To  extract  CT (Hi)),  R  will  isolate  the  portion  of  matrix  G  that  describes  the  circuit 
trees  for  each  output  from  slice  I(i).  We  will  call  this  matrix  a  slice  matrix.  Let  S  (i)  denote 
the  slice  matrix  for  l(i).  For  each  gate  g  at  the  output  of  Hi),  S  (i)  specifies  each  ancestor  of 
gate  g  at  the  top  of  L(i). 

Let  out  denote  the  name  of  the  output  gate  of  Fn.  Nonzero  entries  in  row  out  of  G 
correspond  to  the  ancestors  of  gate  out  at  distance  log  Z  ( n );  that  is,  the  gates  at  the  boundary 
between  S(v-l)  and  X(v-2).  To  extract  CT (£(v-l)),  R  masks  away  all  but  row  out  of  G. 

Let  5(v-l)  denote  this  value. 

In  general,  assume  that  we  have  S  (i  +1),  and  we  want  to  compute  S  (t  ).  First,  R 
computes  the  OR  of  all  boxes  of  S  (i+1).  Let  r\(i)  denote  this  value.  R  computes  a (i )  = 
COLLAPSE  01(0,  Z(/i))  in  0(log  Z(n))  time.  Bit  jZ2(n)  +  kZ(n),  0  £  j,  k  <  Z(n)  -  1,  of 
#a(0  is  0  (1)  if  slot  k  of  box  j  of  S  (i  +1)  contains  all  0’s  (at  least  one  1).  Let  ab(0  denote  bit 
b  of  #a(0.  We  will  use  a (i)  to  select  the  rows  of  G  that  correspond  to  nonzero  slots  of 
S(i  + 1).  Next,/?  computes  o(0  =  SPREAD  (#oc(0,  Z(n))  in  0(logZ(n))  steps.  This  leaves 
bit  a*z(n)(0  at  position  kZ2{n)  of  #a(0;  that  is,  it  aligns  the  bit  of  #a(0  indicating  whether 
or  not  slot  m  contains  a  1  with  box  m  of  G.  Now,  R  computes  0(0  =  FILL  (o(0,  Z2(n)),  and 
#0(0  has  1  ’s  in  the  boxes  of  G  that  correspond  to  ancestors  of  out  at  distance  log  Z  (n). 

R  computes  5  (/)  <—  0(0  A  G.  Thus,  the  boxes  of  G  that  correspond  to  slots  of  S  (i  +1 ) 
that  contain  all  0’s  are  masked  away  in  5  (/).  Each  nonzero  slot  of  5(0  indicates  a  gate  at  the 
top  boundary  of  I(i). 

Stage  4:  Running  the  slices  on  the  input.  At  this  point,  for  0  <  i  <  v  -  1,  R  has  computed 
S(i).  Each  5  (0  contains  a  description  of  CT(X(0)-  (Note:  CT(L(i))  is  a  collection  of 
circuit  trees,  one  for  each  output  of  1(0  )  In  Stage  4,  R  runs  each  CT(Z(0)  in  sequence.  R 
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begins  by  manipulating  the  input  to  to  be  in  the  form  necessary  to  run  on  CT  (L(0)).  The 
input  to  to  Fn  is  2 n  bits  long  (n  input  bits  and  their  complements). 

We  describe  how  R  runs  the  circuit  by  slices.  We  must  take  the  output  \| r(i )  from  Z(i ) 
and  convert  it  into  the  form  needed  for  the  input  to  CT  (E(i+1)).  We  let  to<i )  denote  this 
input. 

We  must  initially  consider  input  co  as  a  special  case.  Without  loss  of  generality  assume 
that  the  input  gates  are  numbered  1,...,  2 n .  We  view  co  as  padded  with  0’s  to  be  Z  (n )  bits 
long.  R  computes  p(0)  SPREAD  (#co,  Z  ( n )).  The  bits  of  co  are  Z(n )  bits  apart  in  4(0),  one 
per  slot  in  a  single  box. 

We  now  define  a  function  COMPRESS ,  the  inverse  of  SPREAD .  The  function 
COMPRESS (P,  y ),  where  #  p  =  P^-i  •  •  •  Pjpo,  returns  the  value  a,  where  #a  = 

P(*-Dy  P(jc -2)>  •  •  •  Py  Po.  in  O  (log  x )  time.  In  general,  we  have  S  ( i )  and  \\f(i  -1).  The  output 
from  a  slice  X(i  -1)  is  in  the  form  of  isolated  bits,  one  for  each  box  corresponding  to 
an  output  from  Z(i  -1).  R  computes  4(i  -1)  =  COMPRESS  (\f(i  -1),  Z  (n )),  then  builds 
yf  (/' -1)  =  FILL  (jx(i-l),  Z  (n ))  in  O  (log  Z  (n ))  time.  The  result  \i(i -1)  has  Z  ( n )  output  bits 
from  \\r(i  -1).  These  are  Z(n )  bits  apart,  one  per  slot  in  a  single  box.  Now  R  concatenates 
Z  ( n )  copies  of  (i  -1);  call  this  v/7  (i  -1).  Each  element  is  Z2(n )  bits  long,  the  length  of  a 

box.  R  computes  5  (i )'  =  S  (i )  A  (( -1);  hence,  #S  (i )’  has  a  1  in  position 

jZ2(n)  +  kZ(n)  +  l  if  gate  j  is  at  the  bottom  of  E(i ),  gate  k  is  the  / th  ancestor  of  j  at  the  top 
of  Z(i ),  and  the  input  to  gate  >k  is  a  1.  R  ORs  all  slots  in  each  box  together  in  O  (log  Z  ( n )) 
steps,  producing  bit  vector  co(i ).  By  our  construction,  c o(i )  is  the  input  to  CT  (L(i )).  Recall 
that  CT (L(i ))  consists  of  alternating  layers  of  AND  and  OR  gates.  We  run  CT (L(i ))  on 
input  c»)(t )  in  O  (log  Z  ( n ))  steps.  Let  temp  o  =  c o(i ).  R  performs  the  following  for  m  =  0,  1 . 
...,  log  Z(n)  -  1. 
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tempn(\)  <-  tempm  i  2m 

(tempm  a  tempm{  1),  if  m  is  even 
tempm  v  tempm{\),  if  m  is  odd 

Let  yO )  =  temp  tog  z(n>  At  most  O  (Z  ( n ))  bits  of  #y(< )  are  the  output  values  of  CT  (X(t ))  on 
input  co(i ). 

It  takes  time  (9  (log  Z  (n ))  to  fix  the  output  from  one  slice  to  go  into  another  slice  and 
time  O  (log  Z(n ))  to  run  a  slice.  Since  there  are  D  (n )  /  log  Z  (n )  slices,  it  takes  time 
O ( D  ( n ))  to  run  a  circuit  on  the  input,  given  the  distance  log  Z(n)  ancestor  matrix  G . 

Theorem  9.1.  Let  C  =  {C  i,  Ci,  •  •  •  }  be  a  family  of  VM-uniform,  bounded  fan-in  circuits 
of  size  Z  (n  )  and  depth  D  ( n )  recognizing  language  L .  There  exists  a  RAM[T ,!]  R  that 
recognizes  L  in  time  O  (D  (n )  +  log  Z  (n )  loglog  Z  (n )). 

Proof.  We  construct  R  by  the  method  described  above.  For  fixed  n ,  R  simulates  Cn  via  Fn 
in  O  (log  Z  (n )  loglog  Z  (n ))  time  to  create  matrix  G ,  then  O  (D  (n ))  time  to  run  Fn  on  input 
co,  given  G .  Thus,  the  overall  time  is  O  (JD  (n )  +  log  Z  (n )  loglog  Z  (n ))  steps.  □ 

9.3.  Simulation  of  PRAM[t,i]  by  RAM[t,i] 

Using  Theorem  9.1,  we  now  simulate  a  PRAM[t,i]  by  a  RAM[T,i],  Recall  that  we 
simulated  a  PRAM(1\i]  by  a  family  of  log-space  uniform  unbounded  fan-in  circuits  UC 
according  to  the  simulation  by  Stockmeyer  and  Vishkin  (1984)  (Lemma  6.2.2),  then 
simulated  this  by  a  family  of  log-space  uniform  bounded  fan-in  circuits  BC  (Lemma  6.2.3). 
In  this  manner,  we  showed  that  a  family  BC  of  bounded  fan-in  circuits  of  depth  O  (THn )) 
and  size  O  (T ( n  )2T^n^)  can  simulate  time  T(n )  on  a  PRAM[t,l].  We  need  only  establish 
that  BC  is  VM-uniform  to  give  a  0  {T\n ))  time  simulation  of  a  PRAM[t ,4-]  by  a 
RAM[T,U 


Definition.  A  PRAM  is  uniform  if  all  processors  execute  the  same  program. 

Lemma  9.2.1.  Let  C  =  [Ci,  Ci, ...}  be  the  family  of  unbounded  fan-in  circuits  described  by 
Stockmeyer  and  Vishkin  (1984)  that  simulates  a  uniform  PRAM  (Theorem  2.1).  C  is  VM- 
uniform. 

Proof.  Stockmeyer  and  Vishkin  present  the  simulation  of  a  nonuniform  PRAM  by  a 
nonuniform  family  of  circuits.  Since  we  study  a  uniform  PRAM,  the  program  size  is 
constant,  and  the  simulating  family  of  circuits  is  log- space  uniform. 

Fix  a  uniform  PRAM  Y  and  an  input  size  n .  The  simulating  circuit  C„  comprises  T  ( n  ) 
identical  time  slices.  Each  time  slice  corresponds  to  a  time  step  of  Y .  Each  time  slice 
comprises  P  (n )  cartons  of  gates,  one  for  each  processor,  and  a  block  of  gates,  [Update- 
Common],  handling  updates  to  common  memory  Each  carton  comprises  13  blocks  of  gates 
handling  various  functions  as  indicated  by  their  names:  [Compute-Operands],  [Add],  [Sub], 
[Local-Read],  [Common-Read],  [^-Compare],  [<-Compare],  [Compute- Address-of-Result], 
[Select-Result],  [Update-Instruction-Counter],  [Local-Change?],  [Common-Change?],  and 
rUpdate-w-Bits-of-Local-Triples].  The  size  of  each  time  slice  of  Cn  is 
O  ( P  (n  )[T  (n  )(n  +T  ( n ))  +  (n  +T (n  ))3  +  (n  +T  (n  ))(n  +P  (n  )T  ( n ))]),  and  the  total  size  of  Cn  is 
T  (n )  times  this  amount. 

The  general  form  of  a  gate  name  is  specified  in  Figure  9.2. 

Let  Z  (n )  denote  the  size  of  C„ .  It  is  clear  from  the  description  of  the  blocks  given  by 
Stockmeyer  and  Vishkin  that  each  block  is  VM-uniform  and  that  the  interconnections 
between  blocks  are  regular.  Thus,  to  prove  that  C  is  VM-uniform,  we  present  an  algorithm 
that  a  RAM[t,i]  R  can  run  to  test  the  connectivity  of  all  pairs  of  gates  in  O  (log  Z  (n ))  time. 
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time  step  P^£°r  nbl£^r  gate  name  within  block 


specifies  carton 

Figure  9.2.  Gate  name  in  C„ 


Let  g  denote  a  gate  name.  Let  slot  A  denote  the  portion  of  Ug  specifying  the  time  step. 
Let  slot  B  denote  the  portion  of  Ug  specifying  the  processor  number.  Let  slot  C  denote  the 
portion  of  Ug  specifying  the  block  number.  Let  slot  D  denote  the  portion  of  Ug  specifying 
the  gate  name  within  the  block.  Let  sconi  (g )  denote  the  contents  of  slot  i  of  Ug . 

The  input  I  is  the  concatenation  of  all  pairs  (g,h),  where  g,h  e  {0,  1, ....  Z  ( n  )oa) } . 
Let  the  portion  of  that  register  holding  the  j  th  pair  be  called  pairfj ).  R  initially  builds  four 
masks:  maskA ,  masks ,  maskc ,  and  maskp  in  0 (log  Z (n ))  time,  such  that  ttmaskl  has  l’s  in 
slot  i  of  every  pair. 

In  the  algorithm  below,  R  compares  parts  of  Ug  and  #h  for  all  pairs  (g ,  h ) 
simultaneously.  R  separates  the  pairs  for  which  the  comparison  is  true  from  the  pairs  for 
which  the  comparison  is  false  by  building  an  appropriate  mask  in  time  O  (log  Z  ( n  )).  For  all 
pairs  (g,h): 

1 .  Using  maskA ,  test  whether  sconA  (g )  =  sconA  (h). 

Mask  away  the  unequal  pairs  in  I .  Call  the  resulting  value  1A  .  Specifically,  #IA 
comprises  pairs  (Ug ,  Uh )  for  which  sconA  (g )  =  sconA  ( h ),  and  0’s  at  the  positions  of 
pairs  (g  \  h  )  for  which  sconA  (g  )  *  sconA  (h  ). 

Mask  away  the  equal  pairs  in  I .  Call  the  resulting  value  1A, 

2.  Test  whether  sconA  (g )  =  1  +  sconA  (h )  in  IA.  Mask  away  those  pairs  for  which 
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sconA  (g)*  1  +  sconA  ( h ).  Call  the  resulting  value  lA+.  Mark  the  pairs  (g ,  h )  for 
which  sconA  (g)  *  1  +  sconA  ( h )  with  a  0  to  indicate  that  h  is  not  an  input  to  g . 

(Gate  h  is  neither  in  the  same  time  slice  as  g  nor  in  the  preceding  time  slice.) 

3.  Using  maskg ,  test  whether  scons  ( g )  =  scons  (h )  in  lA . 

Mask  away  the  unequal  pairs.  Call  the  resulting  value  Is  ■ 

Mask  away  the  equal  pairs.  Call  the  resulting  value  Ig. 

4.  Using  maskc ,  test  whether  scone  (g )  =  scone  (h )  in  IB  ■ 

Mask  away  the  unequal  pairs.  Call  the  resulting  value  Ic 
Mask  away  the  equal  pairs.  Call  the  resulting  value  . 

5.  Using  masko  on  Ic ,  isolate  sconp  ( g )  and  generate  the  names  of  the  leftmost 
and  rightmost  inputs  to  gate  g . 

6.  Test  whether  scon p  ( h )  in  Ic  is  contained  in  the  range  specified  by  the  names  of 
g  ’s  leftmost  and  rightmost  inputs. 

If  so,  then  mark  pair  (g ,  h )  with  a  1  to  indicate  that  h  is  an  input  to  g . 

If  not,  then  mark  pair  (g  ,  h )  with  a  0  to  indicate  that  h  is  not  an  input  to  g . 

7.  (slot  C  not  equal)  Using  masks  on  ,  test  whether  the  block  containing  h  is  an 
input  to  the  block  containing  g  . 

If  not,  then  mark  pair  (g ,  h )  with  a  0. 

Mask  away  the  pairs  for  which  the  test  is  false.  Call  the  resulting  value  /B(~ . 

8.  Using  masko  on  IBc .  test  whether  h  is  an  input  to  g  between  blocks.  (That  is, 
test  whether  h  is  an  input  to  g  and  where  g  and  h  are  in  different  blocks.) 

If  so,  then  mark  pair  (g ,  h )  with  a  1. 

If  not,  then  mark  pair  (g  ,  h)  with  a  0. 

9.  (slot  B  not  equal)  Using  masks  on  Ig ,  test  whether  g  is  in  the  block  [Update- 
Common]. 

If  not,  then  mark  pair  (g  ,  h )  with  a  0.  (Gates  g  and  h  are  in  cartons  belonging  to 
different  processors.) 

If  so,  then  test  whether  It  is  in  a  block  that  feeds  into  [Update-Common].  If  not, 
then  mark  pair  (g ,  h )  with  a  0.  Mask  away  those  pairs  in  Ig  that  fail  the  test.  Call 
the  resulting  value  IBB  ■ 

10.  ( sconA  (g )  =  1  +  sconA  ( h ))  Using  masks  on  IA+,  test  whether  outputs  from  the 
block  containing  h  are  inputs  to  the  block  containing  g . 

If  not,  then  mark  pair  (g,h)  with  a  0. 

Mask  away  those  pairs  in  IA+  that  fail  the  test.  Call  the  resulting  value  IBA  +,  and 
OR  this  with  IBg  ■  Call  the  resulting  value  IB+. 

11.  Using  mask/)  on  IB+,  test  whether  A  is  an  input  to  g  between  blocks.  (That  is, 
test  whether  h  is  an  input  to  g  and  where  g  and  h  are  in  different  blocks.) 
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If  so,  then  mark  pair  (g ,  h )  with  a  1 . 

If  not,  then  mark  pair  (g ,  h )  with  a  0. 

Note  that  the  algorithm  has  a  constant  number  of  steps,  with  no  loops.  Also  note  that  each 
step  can  be  executed  in  time  O  (log  Z ( n )).  Thus,  C  is  VM-uniform.  □ 

Let  BC'  =  { BC  [ ,  BC\ , ... }  be  the  family  of  bounded  fan-in  circuits  that  simulates  the 
family  C  of  unbounded  fan-in  circuits  described  by  Stockmeyer  and  Vishkin  (1984) 
(Theorem  2.1).  The  depth  of  5C„'  is  O  (7  (n  )(log  P  (n  )T  (n ))),  and  the  size  is 
O  ( P  (n  )T ( n  )[T  ( n  )(n  +T  (n ))  +  (n  +T  (n  ))3  +  (n  +7  (n ))(«  +P  (n  )T  (n  ))]). 

Lemma  9.2.2.  BC  is  VM-uniform. 

Proof.  Fix  an  input  length  n .  By  Lemma  9.2.1,  C  is  VM-uniform.  The  fan-in  of  any  gate 
in  Cn  is  at  most  O  ( nP  ( n  )(n  +  P  (n  )T(n ))).  We  construct  BC'  from  C  by  replacing  each 
gate  of  Cn  with  fan-in  /  by  a  tree  of  gates  of  depth  log  / .  Thus,  each  gate  in  Cn  can  be 
simulated  by  a  tree  of  gates  in  BC  'n  of  depth  at  most  O  (log  P  ( n  )T(n )).  Hence,  BCn  is  VM- 
uniform.  □ 

Theorem  9.2.  For  all  7  {n )  >  log  n  and  P  (n)  <,2T('n\ PRAM  -TIME  (7  in))  c 
RAM  [1M-TIME  (7 (n )  log  P  (n  )T(n )). 

Proof.  By  Lemma  9.2.2,  BC' ,  the  family  of  bounded  fan-in  circuits  that  simulates  a  PRAM, 
is  VM-uniform.  By  Theorem  9.1,  a  RAM[t,i]  can  simulate  BC'  in  time 
0(T(n)  log  P  (n  )T  (n )).  □ 

Lemma  9.3.1.  Let  UC  ={UC\,  UCz, ...)  be  the  family  of  unbounded  fan-in  circuits 
described  in  Lemma  6.2.2  that  simulates  a  uniform  PRAMfT.i],  UC  is  VM-uniform. 
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Proof.  By  Lemma  9.2.1,  C  is  VM-uniform.  UC  has  the  same  form  as  C ,  except  in  the 
blocks  labeled  [update  common],  handling  updates  to  common  memory.  We  reduce  the 
inputs  to  the  gates  in  this  block  because  of  restrictions  on  the  processors  that  may 
simultaneously  write  a  cell.  It  is  easy  to  compute  the  processors  that  may  simultaneously 
write  a  cell,  so  C  is  also  VM-uniform.  □ 

Lemma  9.3.2.  Let  BC  =  {BC \,BCi , ... }  be  the  family  of  bounded  fan-in  circuits  described 
in  Lemma  6.2.3  that  simulates  a  uniform  PRAM[T,!].  BC  is  VM-uniform. 

Proof.  Fix  an  input  size  n .  BCn  is  constructed  from  UCn  by  replacing  each  gate  with  fan-in 
/  by  a  tree  of  gates  of  depth  log  / .  A  gate  name  in  BCn  is  the  concatenation  of  the 
unbounded  fan-in  gate  name  in  UCn  and  the  name  of  the  gate  within  the  bounded  fan-in  tree 
that  replaces  the  unbounded  fan-in  gate  (Figure  9.3).  We  prove  VM-uniformity  by  the  same 
algorithm  given  in  the  proof  of  Lemma  9.2.1,  with  modifications  to  test  slot  E ,  the  portion  of 
the  gate  name  giving  the  gate  name  within  the  tree  of  depth  log  / .  By  this  algorithm,  we  see 
that  the  family  BC  of  bounded  fan-in  circuits  is  VM-uniform  since,  by  Lemma  9.3.1,  UC  is 
VM-uniform.  □ 


gate  name  in  UCn  gate  name  in  tree 

of  depth  log  / 

Figure  9.3.  Gate  name  in  BCn 
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Theorem  9.3.  For  all  T(n)>  log  n ,  PRAM  [T,i]-77M£  ( T  (n ))  £ 

RAM[T,l]-T/ME(T3(n)). 

Proof.  By  Lemma  6.2.3  and  Lemma  9.3.2,  for  each  n ,  every  language  recognized  by  a 
PRAM[T,i]  in  time  T  ( n )  can  be  recognized  by  a  VM-uniform,  bounded  fan-in  circuit  BCn 
of  depth  0(T3(n ))  and  size  O  (THn)4T^n'>).  By  Theorem  9.1,  there  exists  a  RAM[t,X] 
running  in  time  O  ( T\n )  +  (log  (78(n  )4r2<"  >))(loglog  (78(n  )4T*n  >)))  =  O  (T3(n ))  that 
simulates  BCn .  □ 

Corollary  9.3.1.  PRAM  [T,X]-/>T/M£  =  RAM  [1 ,1}~PTIME . 

Combining  Theorem  9.2  with  Theorem  6.1  (PRAM  [T,l )-TIME  ( T (n ))  c 
PRAM -TIME  ( T2(n )))  implies 

PRAM  [T,l)-TIME  ( T ( n  ))  c  RAM  [T,l]~TIME  ( T\n )). 

The  simulation  of  Theorem  9.3  is  more  efficient. 

9.4.  Simulation  of  MRAM-Uniform  Circuit  by  RAM[*] 

In  this  section,  we  adapt  the  simulation  of  a  VM-uniform  circuit  by  a  RAM[T,i] 
(Section  9.2)  to  the  case  of  a  simulation  of  an  MRAM-uniform  circuit  by  a  RAM[*]. 

Theorem  9.4.  Let  MC  =  { MC  i,  MCi,  ■  •  }  be  a  family  of  MRAM-uniform,  bounded  fan- 
in  circuits  of  size  Z  ( n  )  and  depth  D  ( n )  recognizing  language  L .  There  exists  a  RAM(*1  R 
that  recognizes  L  in  time  0(D(n)+  log  Z  (n )  loglog  Z  (n )). 

Proof.  Without  loss  of  generality,  assume  that  R  has  two  memories:  mem]  andmm2  R 
performs  the  simulation  described  in  Section  9.2,  using  a  precomputed  table  of  shift  values  in 
mem  2  To  perform  a  left  shift,  such  as  temp  <—  temp  T  j  ,R  performs  temp'  <—  temp*  2 J  To 


perform  a  right  shift  by  j  bits,  R  shifts  all  other  values  in  mem  i  left  by  j  bits,  then  notes  that 
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the  rightmost  j  bits  of  all  registers  are  to  be  ignored  (Hartmanis  and  Simon,  1974).  This 
takes  constant  time  because,  by  reusing  registers,  R  uses  only  a  constant  number  of  registers 
in  mem\.  In  O  (log  Z(n ))  time,  R  computes  the  values  2Z(n)  and  2z^n\  since  Z (n )  and 
Z2(n )  are  the  basic  shift  distances.  In  the  course  of  the  computation,  R  will  perform  shifts 
by  Z  ( n )  2‘ ,  0  <  i  <  log  Z(n ),  for  each  value  of  i .  R  computes  the  necessary  shift  value  on 
each  iteration  from  the  previous  value. 

Thus,  the  simulation  by  R  takes  the  same  amount  of  time  as  the  simulation  described  in 
Section  9.2:  O  (D  (n )  +  log  Z(n )  loglog  Z(n )).  □ 

9.5.  Simulation  of  PRAM[*]  by  RAM[*] 

In  this  section,  we  simulate  a  PRAM[*]  by  an  MRAM-uniform,  bounded  fan-in  circuit 
family,  then  simulate  this  circuit  family  by  a  RAM[*].  We  also  simulate  a  basic  PRAM  by  a 
RAM[*]. 

Lemma  9.5.1.  Let  C  =  (Ci,  C2, ...}  be  the  family  of  unbounded  fan-in  circuits  described  by 
Stockmeyer  and  Visl  kin  (1984)  that  simulates  a  uniform  PRAM  (Theorem  2.1).  C  is 
MRAM-uniform. 

Proof.  The  lemma  follows  by  the  proof  of  Lemma  9.2.1.  □ 

Let  PC  =  {PC  1,  PC 2, ... }  be  the  family  of  bounded  fan-in  circuits  that  simulates  the 
family  C  of  unbounded  fan-in  circuits  described  by  Stockmeyer  and  Vishkin  (1984).  For  a 
fixed  input  size  n ,  the  depth  of  PCn  is  0{T{n)  log  P  ( n  )T ( n ))  and  the  size  is 
O  (P  (n)T  (n)[T  (n)(n+T  (n))  +  ( n  -T  ( n  ))3  +  (n  +T  ( n  ))(n+P  ( n  )T  ( n  ))]). 


Lemma  9.5.2.  PC  is  MRAM-uniform. 
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Proof.  By  Lemma  9.5.1,  C  is  MRAM-uniform.  By  the  proof  given  for  Lemma  9.3.2,  PC  is 
MRAM-uniform.  □ 

Theorem  9.5.  For  all  T  (n )  >  log  n  and  P  (n  )  >  2r(n),  PRAM  -TIME  ( T  ( n ))  c 
RAM  [*  }-TIME  (T  (n )  log  P  ( n  )T  ( n )). 

Proof.  By  Lemma  9.5.2,  PC ,  the  family  of  bounded  fan-in  circuits  that  simulates  a  PRAM, 
is  MRAM-uniform.  By  Theorem  9.4,  a  RAM[*]  can  simulate  PC  in  time 
0(T(n)  log  P  (n  )T  ( n )).  □ 

Let  BC  denote  the  family  of  bounded  fan-in  circuits  described  in  the  proof  of  Lemma 
4.2.2  that  simulates  a  PRAM[*]  in  depth  0  ( T2(n ))  and  size  O  (n2  T2(n )  8r(n)  log  T ( n )). 

We  construct  a  family  of  bounded  fan-in  circuits  BC'  from  BC .  Fix  an  input  size  n  .  The 
circuit  BCn  is  exactly  the  same  as  the  circuit  BCn  except  that  BCn  uses  a  different 
multiplication  block  for  reasons  of  MRAM-uniformity.  Insen  carry-save  multiplication 
blocks  in  BCn.  Each  block  has  depth  O  ( T  ( n ))  and  size  O  (n14T<-n'>).  Thus,  BCn  has  depth 
O  ( T2(n  ))  and  size  O  (n2  T2(n )  8r(,,)  log  T (n )). 

Lemma  9.6.1.  For  each  n ,  every  language  recognized  by  a  PRAM[*]  R  in  time  T  ( n  )  with 
P  ( n )  processors  can  be  recognized  by  the  bounded  fan-in  circuit  of  depth  O  ( T2(n  ))  and 
size  O  (n  2  T2{n )  ST(-n )  log  T  (n )). 

Proof.  The  proof  is  similar  to  that  given  for  Lemma  4,2.2.  □ 

Lemma  9.6.2.  BC  is  MRAM-uniform. 

Proof.  By  Lemma  9.5.1,  C  is  MRAM-uniform,  By  a  proof  like  that  given  for  Lemma  9.3.2, 
BC  is  MRAM-uniform.  □ 


Lemma  9.6.3.  For  each  n ,  every  language  recognized  by  a  PRAM[*]  Y  in  time  T  ( n )  with 
P(n )  processors  can  be  recognized  by  a  MRAM-uniform,  bounded  fan-in  circuit  BCn  of 
depth  O  ( T2(n ))  and  size  O  (710(n )  P\n)  16r  <">). 

Proof.  Fix  an  input  length  n .  By  Lemma  9.6.2,  a  bounded  fan-in,  MRAM-uniform  circuit 
BCn  of  depth  O  ( T2(n ))  and  size  O  (710(n )  P4(n)  16r(n))  can  simulate  Y .  □ 

Theorem  9.6.  For  all  T(n )  >  log  n ,  PRAM  [*  ]-TIME  (T (n ))  c  RAM  [*  ] -TIME  ( T\n )). 

Proof.  By  Lemma  9.6.3,  a  PRAM[*]  running  in  time  T{n)  with  P  ( n )  processors  can  be 
simulated  by  a  bounded  fan-in,  MRAM-uniform  circuit  BCn  of  depth  O  ( T\n ))  and  size 
<9(710(«) />4(n)  16r(n)).  By  Theorem  9.4,  a  RAM[*]  can  simulate  BCn  in  time  O  (T2(n )). 

□ 

9.6.  Simulation  of  PRAM[*,+]  by  RAM[*,+] 

In  this  section,  we  simulate  a  PRAM[*,+]  by  a  RAM[*,+], 

Theorem  9.7.  For  all  T(n )  >  log  n ,  PRAM  [*  ,+]-TIME  ( T(n ))  c 
RAM  [*  ,+]-TIME  (T^(n )). 

Proof.  By  the  proof  of  Theorem  5.1,  a  PRAM  Z  can  simulate  Y  in  time  O  (T2(n ))  with 
0(P2(n )  T2(n)  log  T (n)  n2  4r("))  processors.  By  Theorem  9.5,  a  RAM[*],  hence  a 
RAM[*,+],  can  simulate  Z  in  time  O  (T^in )).  □ 

Note  that  the  family  DC  of  bounded  fan-in  circuits  in  Lemma  5.2.1  simulates  the 
PRAM[*,+]  in  depth  O  ( T2(n )  log  7 ( n )).  We  expect  to  show  that  DC  is  MRAM-uniform  by 
proving  that  the  division  circuit  of  Shankar  and  Ramachandran  (1987)  is  MRAM-uniform. 
This  would  lead  to  a  O  (T2(n )  log  7 ( n ))  time  simulation  of  a  PRAM[*,-!-]  by  a  RAM[*,+]. 
We  are  currently  working  on  this  problem. 
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As  we  noted  in  Chapter  5,  a  time-bounded  RAM[+]  is  much  weaker  than  a  time- 
bounded  PRAM[+].  Therefore,  a  simulation  of  a  PRAM[+]  by  a  RAM[+]  would  be  highly 
inefficient, 


\ 
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Chapter  10.  Alternatives 

The  PRAM  is  a  flexible  model,  and  many  researchers  have  varied  particular  aspects  of 
the  model.  Variations  have  arisen  in  whether  to  allow  concurrent  writes  and,  if  so,  in  the 
rules  governing  concurrent  writes,  in  whether  to  allow  concurrent  reads,  in  the  amount  of 
local  memory  allotted  to  each  processor,  in  the  use  of  shared  memory,  and  in  the  mechanism 
for  processor  activation.  Indeed,  the  focus  of  this  thesis  is  relating  various  instruction  sets  in 
the  PRAM  model. 

In  this  chapter,  we  discuss  some  of  these  variations  and  study  their  effects  on  our 
results. 

•  Write  conflict  resolution 

We  allow  our  PRAM  concurrent  read  and  concurrent  write  (CRCW)  ability,  resolving 
write  conflicts  by  giving  priority  to  the  lowest  numbered  processor  attempting  to  write.  Fich 
et  al.  (1985)  called  this  model  the  PRIORITY  model.  A  number  of  other  conflict  resolution 
schemes  exist:  COMMON,  in  which  all  processors  attempting  to  write  must  write  the  same 
value;  ARBITRARY,  in  which  an  arbitrary  processor  succeeds  in  its  write  attempt; 
COLLISION,  in  which  a  special  collision  symbol  appears  in  a  cell  if  two  or  more  processors 
attempt  to  write  that  cell  simultaneously;  and  TOLERANT,  in  which  the  contents  of  a  cell  do 
not  change  in  the  event  of  a  write  conflict.  PRIORITY  is  the  strongest  scheme.  For  a 
discussion  of  detailed  relationships  among  these  models,  see  Kucera  (1982),  Fich  et  al. 
(1985),  Li  and  Yesha  (1986),  Fich  et  al.  (1987),  Grolmusz  and  Ragde  (1987),  Fich  et  al. 
(1988a),  and  Fich  et  al.  (1988b). 


•  CRCWvs.  EREW 


We  may  restrict  the  model  by  disallowing  concurrent  writes,  giving  a  concurrent  read, 
exclusive  write  (CREW)  version,  or  we  may  further  restrict  the  model  by  disallowing 
concurrent  reads,  giving  an  exclusive  read,  exclusive  write  (EREW)  version.  For 
relationships  among  these  restrictions,  see  Eckstein  (1979),  Vishkin  (1983a),  Snir  (1985), 
Cook  et  al.  (1986),  Reischuk  (1987),  and  Parberry  and  Yuan  (1987). 

In  the  preceding  chapters,  we  presented  relations  among  CRCW  PRAMs  with  various 
instruction  sets.  We  now  prove  similar  results  for  EREW  PRAMs,  but  with  slightly  higher 
time  bounds,  by  convening  the  simulations  by  bounded  fan-in  circuits  to  simulations  by 
EREW  PRAMs. 

Lemma  10.M.  Letfl  (fllt  flj >  ‘ '  }  be  a  log-space  uniform  family  of  bounded  fan-in 
circuits  of  depth  D  (n)  <  log  n  that  accepts  language  L.  There  exists  an  EREW  PRAM  EP 
that  runs  in  time  O ( D (n))  that  accepts  language  L. 

Proof.  The  proof  is  given  in  Karp  and  Ramachandran  (1988).  □ 

First,  we  simulate  an  EREW  PRAM[*]. 

Theorem  10.1.  EREW  -PRAM  [*  }-TIME(T{n))  £  EREW -PRAM -TIME  ( Tz(n )). 

Proof.  The  theorem  is  true  by  Lemmas  4.2.2  and  10.1.1.  □ 

Next,  we  simulate  an  EREW  PRAM[*,+]  and  an  EREW  PRAM[+]. 

Theorem  10.2.  EREW  -PRAM  [*  +]-TIME(T(n))  C 
EREW  -PRAM  -TIME  (T2  (n)  log  T(n)). 

Proof.  The  theorem  is  true  by  Lemmas  5.2.2  and  10.1.1.  □ 


Lemma  10.3.1.  An  EREW  PRAM  can  compute  the  quotient  of  two  x  bit  operands  in 
O  (log  x  loglog  x)  time. 

Proof.  The  lemma  is  true  by  Lemmas  5.2.1  and  10.1.1.  □ 

Lemma  10.3.2  states  the  key  to  converting  CRCW  simulations  into  EREW  simulations. 

Lemma  10.3.2.  Let  CR  be  a  CRCW  PRAM  running  in  T(n)  time  with  P  ( n )  processors,  and 
at  most  Q  ( n )  processors  simultaneously  read  or  write  a  single  cell.  An  EREW  PRAM  ER 
can  simulate  CR  in  O  (T (n)  log  Q  ( n ))  time. 

Proof.  A  concurrent  read  or  write  by  Q  (n)  processors  of  CR  takes  O  (log  Q  ( n ))  time  on  ER 
by  the  method  described  by  Vishkin  (1983a),  when  we  replace  Batcher’s  sort  with  the  faster 
parallel  merge  son  of  Cole  (1986).  □ 

We  must  modify  the  Associative  Memory  Lemma  to  apply  to  EREW  PRAMs. 

Lemma  10.3.3.  (EREW  Associative  Memory  Lemma)  Let  op  c  {*,  T,  i).  For  all  T  (n)  and 
P  (n),  every  language  recognized  with  P(n)  processors  in  time  T (n)  by  an  EREW 
PRAM  [op  J  ER  can  be  recognized  in  time  O  (T  (n)  log  ( P(n)T  («)))  by  an  EREW  PRAM[op] 
ER '  that  accesses  only  cells  with  addresses  in  0, ...,  O  ( P  ( n)T ( n )). 

Proof.  At  most  P  ( n)T ( n )  processors  simultaneously  read  or  write  a  single  cell  in  the 
simulation  presented  in  Chapter  3  in  the  proof  of  the  Associative  Memory  Lemma.  By 
Lemma  10.3.2,  each  step  of  this  simulation  can  be  simulated  by  ER'  in  O  (log  (P  ( n)T in))) 
steps.  □ 

Theorem  10.3.  EREW -PRAM  [+]-TIME  (T  (n))  q  EREW -PRAM -TIME  (T2(n)). 

Proof.  An  EREW  PRAM[+]  can  generate  numbers  only  up  to  n  +  T (n)  bits  long;  hence,  by 
Lemma  10.3.1,  an  EREW  PRAM  takes  O  Qog2{n+T ( n )))  time  to  compute  the  quotient  of 
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two  such  numbers.  The  EREW  PRAM  simulates  the  EREW  PRAM[+]  through  Lemma 
10. _>. 3,  and  the  memory  accesses  cf  Lemma  10.?.. 3  dominate  the  computation  time  of  each 
step.  Hence,  by  Theorem  5.3  and  Lemma  10.3.3,  an  EREW  PRAM  can  simulate  an  EREW 
PRAMM  running  in  time  T (n)  in  O  (T2(n))  steps.  □ 

Now  we  simulate  an  EREW  PRAM[T,1], 

Theorem  10.4.  EREW  -PRAM  [T,  l]-TIME  (T  (n ))  Q  EREW  -PRAM  -TIME  (T3  (n». 
Proof.  The  theorem  is  true  by  Lemmas  6.2.3  and  10.1.1.  □ 

Finally,  we  simulate  EREW  PRAMs  with  probabilistic  choice. 

Theorem  10.5.  Let  op  e  (A.,  {*},{*,+  },  {*,T,1} }.  Let  R  be  a prob-KAM[op]  with 

time  bound  T  ( n )  that  makes  R  ( n )  random  choices.  There  is  a  deterministic  EREW 
PRAM[>9/7]  ED  that  simulates  R  in  O  (T ( n ))  time  with  2*(n)  processors. 

Proof.  The  theorem  follows  exactly  by  the  proof  of  Theorem  8.4.  □ 

As  in  Chapter  8,  we  extend  the  simulation  of  a  sequential  machine  to  a  simulation  of  a 
parallel  machine.  Again,  we  will  need  two  proofs:  one  for  PRAMs  with  enhanced 
instruction  sets  and  one  for  basic  PRAMs. 

Theorem  10.6.  Let  ope  {{*},{*,+  },  {T,i},  {*,T,i} }.  Let  EP  be  an  EREW prob- 
PRAMfop]  with  time  bound  T ( n ),  processor  bound  P  ( n ),  and  memory  bound  S  ( n )  that 
makes  R  ( n )  random  choices.  There  is  a  deterministic  EREW  PRAM[op]  ED  that  simulates 
EP  in  time  0(R  ( n )  +  T(n)  log  P(n))  with  P(n)2R^  processors. 

Proof.  The  proof  follows  fhe  proof  of  Theorem  8.5,  with  a  modification  made  for  the 
exclusive  read  and  exclusive  write  restrictions.  The  modification  is  in  reading  Up  to 
P  ( n )  processors  may  wish  to  read  at  each  step,  taking  O  (log  P  (n))  time  by  Lemma 
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10.3.2.  This,  however,  is  the  same  amount  of  time  required  for  sorting  the  write  requests,  so 
the  time  per  step  remains  O  (log  P  (n)).  Thus,  the  overall  time  for  ED  to  simulate  EP  is 
O  (R(n)  +  T(n)  log  P{n))  steps.  □ 

Theorem  10.7.  Let  EP  be  an  EREW  prob- PRAM  with  time  bound  T  ( n ),  processor  bound 
P  ( n ),  and  memory  bound  S  ( n )  that  makes  R  (n)  random  choices.  There  is  a  deterministic 
EREW  PRAM  ED  that  simulates  EP  in  time  O  {R  ( n )  +  T(n)  log  P  ( n ))  with  P  (n) 2R{n) 
processors. 

Proof.  The  proof  is  that  given  for  Theorem  10.6,  with  the  exceptions  listed  in  the  proof  of 
Theorem  8.6.  □ 

We  also  present  an  EREW  version  of  the  Markov  chain  proof. 

Lemma  10.8.1.  Let  A  and  B  be  zxz  integer  matrices  stored  one  element  per  cell  in  the 
shared  memory  of  an  EREW  PRAM[*]  E.  E  can  compute  their  product  C  =  AB  in  O  (log  z) 
time. 

Proof.  In  O  (log  z)  steps,  £  activates  z3  processors,  assigning  z  processors  to  each  element 
of  matrix  C.  For  each  element  C  (g,  h)  and  all  1 1  i  £  z,  the  ith  of  its  z  processors  computes 
A  (g,  i)  *  B  (i,  h).  Since  each  element  of  A  and  each  element  of  B  are  read  by  z  processors, 
£  takes  O  (log  z)  time  to  read  the  elements.  Next,  also  in  O  (log  z)  time,  the  processors 
assigned  to  each  element  of  C  add  their  products,  writing  the  sum  into  the  cell  allocated  to 
that  element.  □ 

As  a  preliminary  step,  we  must  describe  a  version  of  the  Associative  Memory  Lemma 
for  EREW  pr<?&-PRAMs. 

Lemma  10.8.2.  (Associative  Memory  Lemma  for  EREW  prob-PRAMs)  Let  op  Q  {*,  +,  t, 
1).  For  all  T(n),  P(n),  andS(n),  every  language  recognized  in  time  T ( n )  with  P  (n) 
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processors  using  at  most  S(n)  cells  by  an  EREW  prob-PPJkM[op]  EP  can  be  recognized  in 
time  0(T(n)  log  (P(n)T (n)))  by  an  EREW prob-PRAM[op]  EP'  that  accesses  only  cells 
with  addresses  in  0, O  (5  ( n )). 

Proof.  By  Lemma  8.7.1,  a  CRCW  prob-PRAM[op]  CP'  running  in  time  0(T ( n ))  with 
0  ( P2{n)T  ( n ))  processors  recognizes  every  language  recognized  by  EP.  By  Lemma  9.1.1, 
an  EREW  prob-PRAM[op ]  EP'  simulates  each  step  of  CP'  in  time  0(\og(P(n)T ( n ))).  □ 

Theorem  10.8.  Let  EP  be  an  EREW  proZ?-PRAM[*,+]  with  time  bound  T  (n),  processor 
bound  P  (n),  memory  bound  S ( n ),  integer  bound  /(n),  and  program  length  k.  Then  there  is  a 
deterministic  EREW  PR  AM[  *,-!-]  ED  that  simulates  EP  in  time 
0((P  ( n )  +  log /(n ))•$(«)•  log(T (n)))  with  O  {(kp{n)I (rt))35(n>)  processors. 

Proof  (sketch).  Let  EP’  simulate  EP  according  to  Lemma  10.8.2.  Then  EP'  has  time  bound 
0(T(n)  log  (P  ( n)T ( n ))),  processor  bound  O  ( P  ( n)S ( n )),  memory  bound  S  ( n ),  and  integer 
bound/ («).  By  Lemma  10.8.1,  an  EREW  PRAM[*,+]  can  simulate  ER’  according  to  the 
simulation  described  in  the  proof  of  Theorem  8.7  in  the  same  time.  Since  P  ( n )  <  2r(n), 

O  (log  ( T ( n )  log  (P  (n)T («))))  =  O  (log  T(n)).  □ 

•  Input  convention 

PRAM  definitions  sometimes  differ  in  the  input  convention.  The  input  in  our  model  is 
a  single  integer  n  bits  long  in  c(0).  We  have  the  special  instruction  r  (i)  4—  BIT (J),  which 
places  the  rcon  (J)th  bit  of  the  input  in  r(i).  Two  other  input  styles  are  used:  (1)  the  input 
consists  of  n  bits,  one  each  in  c(0),  c(l),  ...,  c(n-l);  (2)  the  input  consists  of  r  integers,  one 
each  in  c  (0),  c  (1),  ...,  c(r-l),  and  the  sum  of  the  lengths  of  these  integers  is  n.  In  O  (log  n  ) 
time,  our  PRAM  can  convert  its  input  to  either  of  the  other  styles  by  activating  one  processor 
for  each  bit  position,  then  using  the  BIT  instruction  to  read  individual  bits  of  the  input.  Note 
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that  without  the  BIT  instruction,  the  conversion  would  take  O  (n)  time  on  the  basic  PRAM 
because  the  PRAM  must  build  bit  masks  n  bits  long  to  read  individual  bits  of  the  input. 
Similarly,  the  basic  PRAM  takes  O  ( n )  time  to  convert  from  either  of  the  other  styles  to  our 
input  style.  A  PRAM[*]  or  PRAM[T,i]  can  convert  the  input  from  one  style  to  another  in 
0(log  n)  time. 

•  Local  memory 

We  allow  each  processor  infinite  local  memory.  This  definition  was  convenient,  but  not 
necessary,  since  a  PRAM  in  which  each  piccessor  has  a  constant  number  of  local  registers 
can  simulate  our  PRAM  with  only  a  constant  factor  increase  in  time.  We  present  two 
theorems  to  establish  this  fact,  one  for  basic  PRAMs  and  one  for  PRAMs  with  enhanced 
instruction  sets. 

Theorem  10.9.  Let  Z  be  a  PRAM  running  in  T  (n)  time  with  P  (n)  processors  in  which  each 
processor  has  infinite  local  memory.  A  PRAM  R  in  which  each  processor  has  only  4  local 
registers  can  simulate  Z  in  O  (T ( n ))  time  with  P  (n)  processors. 

Proof.  We  allow  R  three  separate  shared  memories:  mem  mem  2,  and  mem 3.  R  uses 
mem  1  to  simulate  the  shared  memory  of  Z,  mem  2  to  simulate  the  local  memories  of  Z,  and 
memj  to  store  an  address  table.  In  mem2,R  sets  aside  a  block  of  O  (2r(n))  cells  for  each 
processor  of  R  to  use  as  the  local  memory  of  the  corresponding  processor  of  Z.  For 
simplicity,  assume  the  size  of  each  block  is  2T(-n\  R  will  access  c  2(m  2rw  +  k)  for  every 
access  to  rn(k)  of  Z. 

R  activates  P  (n  )  processors  in  O  (log  P  («))  time.  These  processors  make  an  address 
table  in  mem 3,  storing  m2T(-^  in  cell  m,  for  1  <,m  ^  P(n).  This  takes  0(T{n)  +  log  P(n)) 
time.  R  uses  this  address  table  to  speed  access  to  the  blocks  of  cells  in  mem  2- 
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Let  Pm  be  the  processor  of  R  that  corresponds  to  processor  Pg  of  Z,  When  Pm  is 
activated,  it  computes  g  and  "■ntes  g  in  rm(0). 

In  a  general  step  ot  Z.  suppose  processor  Pg  executes  r(i)  <—  r(j)  O  r(k),  To  simulate 
the  read  of  the  contents  of  rg(j),  the  corresponding  processor  Pn  of  R  writes  j  in  rm( 2), 
copies  cortj(g)  into  rm(l),  and  adds  j  to  rconm(  1).  Pm  then  accesses  mem 2(3  2Tin)+j) 
indirectly  through  rm(  1).  Pm  performs  the  same  actions  to  read  the  contents  of  rg(k),  except 
using  registers  2  and  3  instead  of  1  and  2.  Now  Pm  performs  r(3)  «-  r(l)  O  r(2).  R 
simulates  the  write  in  rg(i)  just  as  described  above. 

R  uses  O  (T (n)  +  log  P  («))  =  0(7 (n))  initialization  time  and  constant  time  to  simulate 
each  step  of  Z.  Thus,  R  simulates  Z  in  0(T (rtf)  time  with  P(n)  processors.  Each  processor 
Pm  of  R  uses  only  four  registers  rm(0),  •  •  • ,  rm( 3).  □ 

Theorem  10.10.  Let  op  C  {*,  T,  1).  Let  Z  be  a  PRAM[op]  running  in  T (n)  time  w.  h  P  (n) 
processors  in  which  each  processor  has  infinite  local  memory.  A  PRAM[op]  R  in  which  each 
processor  has  only  4  local  registers  can  simulate  Z  in  O  (T («))  time  with  P  ( n )  processors. 

Proof.  We  allow  R  two  separate  shared  memories:  mem  y  and  mem  2-  R  uses  mem ,  to 
simulate  the  shared  memory  of  Z  and  mem  2  to  simulate  the  local  memories  of  Z.  R  will 
access  C2(kP(n)  +  m)  for  rm(k)  of  Z.  If*  &op,  then  assume  without  loss  of  generality  that 
P  (n)  is  a  power  of  2;  this  way  R  can  perform  shifts  to  perform  the  multiplications  specified 
below. 

R  activates  processors  as  specified  by  the  program  of  Z,  so  processor  Pm  of  R  simulates 
processor  Pm  of  Z.  Pm  of  R  stores  P  (n)  (or  log  P  ( n )  if  R  does  not  have  multiplication)  in 
rm(l).  In  a  general  step  of  Z,  suppose  processor  Pm  executes  r(i)  <—  r(j)  O  r(k).  To 
simulate  the  read  of  the  contents  of  r(J),  the  corresponding  processor  Pm  of  R  writes  j  in 
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rm( 2),  multiplies  rconm(2)  by  P  (n),  then  adds  m  to  the  product.  (By  definition,  Pm  has  m  in 
r0(m).)  Pm  then  reads  mem2(jP  ( n)+m )  indirectly  through  rm( 2),  writing  con2(jP  ( n)+m )  in 
rm( 2).  Pm  performs  the  same  actions  to  read  the  contents  of  r{k),  except  using  register  3 
instead  of  2.  Now  Pm  performs  r(3)  «-  r(l)  O  /•( 2).  R  simulates  the  write  in  r(i)  just  as 
described  above. 

R  uses  constant  time  to  simulate  each  step  of  Z.  Thus,  R  simulates  Z  in  O  (T  ( n ))  time 
with  P  ( n )  processors.  □ 

•  Operations  in  shared  memory 

We  restricted  the  instruction  set  so  that  the  only  operations  permitted  on  shared  memory 
are  indirect  reads  and  writes.  Again,  this  definition  was  convenient,  but  not  necessary,  since 
such  a  PRAM  can  simulate  a  PRAM  allowing  all  operations  in  shared  memory  with  only  a 
small  constant  factor  increase  in  time. 

Theorem  10.11.  Let  Z  be  a  PRAM  running  in  T (n)  time  in  which  all  instructions  can  be 
performed  in  either  shared  or  local  memory.  A  PRAM  R  allowing  only  indirect  reads  and 
writes  to  shared  memory  can  simulate  Zin  O  {T  ( n ))  time. 

Proof.  For  all  m,  processor  Pm  of  R  simulates  processor  Pm  of  Z.  At  time  r,  suppose  Pm  cf  Z 
executes  c  (i)  (-c(j)Oc  ( k ).  Then  Pm  of  R  copies  con  (J)  and  con  (k)  into  its  local 
memory,  performs  O,  then  writes  con  (J)  O  con  (k)  in  c  (/).  Thus,  R  simulates  each  step  of 
Z  in  constant  time.  □ 

•  Processor  activation 

A  common  method  of  processor  activation  is  to  assume  that  all  P  (n)  processors  are 
initially  active.  Bounds  for  arbitrary  P  (n)  with  this  method  of  processor  activation  can  be 
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derived  from  our  simulations  because  we  presented  all  simulations  for  arbitrary  P  (n ),  then 
fixed  P  (rt)  <  2r(n)  in  deriving  the  final  bounds. 

The  final  area  of  alternative  definitions  is  the  FORK  operation.  Recall  that  we  defined 
the  instruction  FORK  label  1,  label  2  as  executed  by  Pg  to  cause  Pg  to  halt  and  activate  P  ig 
and  Pig+\  >  setting  their  program  counters  to  label  1  and  label  2,  respectively.  Fortune  and 
Wyllie  (1978)  defined  FORK  label  as  executed  by  Pg  to  activate  the  lowest  numbered 
inactive  processor  Pm,  clear  the  local  memory  of  Pm,  copy  the  contents  of  the  accumulator  of 
P^  in  the  accumulator  of  Pm,  and  set  the  program  counter  of  Pn  to  label.  Let  us  call  this 
operation  FW  -FORK.  We  now  present  a  lemma  to  establish  that  our  simulations  all  work 
with  the  same  bounds  with  FW  -FORK  in  the  place  of  FORK. 

The  main  difference  between  FORK  and  FW -FORK  is  in  the  processor  number  of  the 
activated  processor(s).  In  our  simulation,  the  processor  number  is  important  in  establishing 
the  relationship  between  a  primary  and  its  secondary  processors.  Recall  the  Activation 
Lemma  (Chapter  3):  for  a  primary  processor  Pg  with  o  secondary  processors,  the  secondary 
processors  are  numbered  k  +  g,  for  k  =  1,...,CT.  Further,  each  secondary  processor  Pg+k 
computes  k  in  order  to  assign  itself  to  an  item  indexed  by  k  in  the  computation.  With 
FW  -FORK,  the  relationship  between  processor  numbers  of  primary  and  secondary 
processors  is  different:  with  tt  primary  processors  and  a  secondary  processors,  the  secondary 

processors  belonging  to  primary  processor  Pg  are  numbered  kn  +  g,  for  all  k  =  1 . a.  Once 

again,  each  secondary  processor  Pkn+g  must  compute  k  in  order  to  assign  itself  to  an  item 
indexed  by  k  in  the  computation.  These  secondary  processors  activated  by  FW-FORK 
cannot  determine  k  as  easily  as  the  processors  activated  by  FORK,  especially  in  a  PRAM|  *) 
without  division.  The  next  lemma  describes  a  method  by  which  the  processors  quickly 


determine  k. 
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Ordering  Lemma.  Let  n  and  g  be  fixed  positive  integers,  0  <  g  <  n-l,  and  let  a  be  another 
integer.  Let  T  denote  the  set  of  processors  { Pm  I  m  =  kit  +  g,  1  <  k  <  a}.  Each  processor 
Pm  in  T  can  determine  k  in  O  (log  a)  time. 

Proof.  Assume  tt,  cr,  and  g  are  known.  In  time  O  (log  a),  one  processor  builds  Table  1  in 
mem  i  such  that  location  h  contains  the  value  2h,  0  <  h  £  f  log  a]  .  Also  in  time  O  (log  a), 
another  processor  builds  Table  2  in  mem  2  such  that  location  h  contains  the  value  n2h, 

0  <  h  <  f  log  o]  .  In  the  following,  the  values  2h  and  K2h  are  read  from  Table  1  and  Table  2, 
respectively.  Each  processor  Pm ,  m  =  kx  +  g,  determines  k  as  follows. 

1.  a  :=  1,  (3  :=  1.  (a,  P  are  indices  into  the  tables.) 

2.  Compare  Tt2a  with  m.  If  rc2a  >  m,  then  k  :=  1. 

3.  a :=  a+  1. 

4.  Compare  7t2“  with  m.  If  n2a  <  m ,  then  go  to  Step  3.  If  rt2“  >  m,  then 
0/2  S  k  <  a.  (Pm  will  determine  the  value  of  k  within  this  range  by  binary 
search.) 

5.  lower  :=  7c2a-1 ,  upper  :=  7t2a,  and  k.bound  :=  2“-1 . 

6.  P  :=  P  +  1. 

7.  middle  :=  lower  +  7t2a-^,  k.bound  :=  k.bound  +  2a-^. 

8.  If  middle  <  m  -  g,  then-  lower  :=  middle ;  go  to  Step  6. 

If  middle  -m  -  g,  then  k  :=  k.bound.  Done. 

If  middle  >  m  -  g,  then  upper  :=  middle  and  k.bound  :=  k.bound  -  2“"^;  go  to 
Step  6. 

Pm  performs  each  step  in  the  above  algorithm  in  constant  time.  A  processor  may  iterate 
Steps  6-8  or  Steps  3  and  4  up  to  O  (log  a)  times.  The  processors  build  Tables  1  and  2  in 
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O  (log  a)  time.  Thus,  each  processor  in  T  can  determine  k  in  O  (log  a)  time.  Note  that  the 
algorithm  uses  only  addition  and  subtraction.  □ 

Observe  that  O  (log  o)  is  the  same  as  the  time  required  to  activate  a  processors,  so  the 
Ordering  Lemma  implies  no  more  than  a  constant  factor  increase  in  time  in  a  simulation  if 
FW -FORK  replaces  FORK. 
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Chapter  11.  Summary  and  Open  Problems 

11.1.  Summary 

In  this  thesis,  we  compared  the  computational  power  of  time  bounded  Parallel  Random 
Access  Machines  (PRAMs)  with  different  instruction  sets.  We  proved  that  polynomial  time 
on  PRAM[*]s  or  on  PRAM[*,-i-]s  or  on  PRAM[T,i]s  is  equivalent  to  polynomial  space  on  a 
Turing  machine  ( PSPACE ).  In  particular,  we  showed  the  following  bounds.  Let  each 
simulated  machine  run  for  T ( n )  steps  on  inputs  of  length  n;  let  T  denote  T{n)  in  the  table 
below.  The  simulating  machines  are  basic  PRAM,  Turing  machine,  RAM  with  the  same 
instruction  set,  basic  EREW  PRAM,  and  uniform  family  of  bounded  fan-in  circuits.  The 
bounds  for  the  simulating  machine  are  expressed  in  time,  space,  or  depth,  as  shown  in 
parentheses  by  the  machine  type.  The  notation  EREW  means  that  the  simulating  machine  is 
an  EREW  PRAM,  and  the  simulated  machine  is  an  EREW  PRAMfop], 


Table  11.1.  Summary  of  results 


Simulating 

machine 

Simulated  machine 

PRAMr*l 

PRAM[*,+1 

PRAMM 

PRAM  (time) 

T2  /  log  T 

T 2 

riog(«+T) 

T 2 

TM  (space) 

T 2 

T 2  log  T 

T 2 

r3 

RAM[<9£>]  (time) 

T2 

73 

... 

r3 

EREW  (time) 

T 2 

T 2  log  T 

T 2 

T3 

circuit  (depth) 

T 2 

T 2  log  T 

T2 

r3 

As  noted  in  Section  9.6,  the  simulation  of  a  PRAM[+]  by  a  RAM[+]  is  highly 
inefficient. 

Further,  we  proved  that  PRAM  [*,  T,l]-PTIM£  is  contained  between  NEXPTIME  and 
EXPSPACE.  This  is  notable  because  polynomial  time  on  a  PRAM  with  either  multiplication 
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or  shifts  alone  is  equivalent  to  PSP  ACE.  Recall  the  Parallel  Computation  Thesis: 
polynomial  time  on  a  reasonable  model  of  parallel  computation  is  equivalent  to  polynomial 
space  on  a  sequential  model  of  computation.  Our  result  of  PRAM[*,1\i]s  does  not 
contradict  the  Parallel  Computation  Thesis  because  the  numbers  generated  by  multiplication 
and  shift  together  are  too  long  and  complex  to  be  “reasonable.” 

We  also  presented  simulations  of  probabilistic  PRAMs  by  deterministic  PRAMs,  using 
parallelism  to  replace  randomness. 

11.2.  Open  Problems 

1  As  noted  in  Chapter  1,  if  we  could  reduce  the  number  of  processors  used  by  the 
simulation  of  a  PRAM[*]  or  PRAM[*,-!-]  or  PRAM[T,i]  by  a  PRAM  from  an  exponential 
number  to  a  polynomial  number,  then  NC  would  be  the  languages  accepted  by  PRAM[*]s, 
PRAM[*,+]s,  or  PRAM[t,i]s,  respectively,  in  polylog  time  with  a  polynomial  number  of 
processors.  Can  the  number  of  processors  used  by  the  PRAM  in  simulating  the  PRAM[*]  be 
reduced  to  a  polynomial  in  P  ( n)T (n)? 

2.  We  showed  NEXPTIME  c  PRAM  [*,  T]-PTIME  c  EXPSPACE  (Corollaries  7.1.1 
and  7.2.1).  Does  PRAM[*,  T ]-PTIME  =  EXPSPACE ? 

3.  What  is  the  relationship  between  RAM[*,?]s  and  PRAM[*,T]s?  Is  NEXPTIME  Q 
RAM[ *,  T]-PTIME2 

4.  Can  a  log-space  uniform,  fan-in  2  O  (log  n )  depth  circuit  perform  division?  Beame 
et  al.  (1986)  developed  a  poly-time  uniform  division  circuit.  We  could  improve  Theorems 
5.1,  5.2,  and  5.3  with  a  log-space  uniform,  O  (log  n)  depth  division  circuit. 
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5.  Can  the  log  7 ( n )  factor  in  PRAM  [*.  +]-TIME  (7 (n ))  c  DSPACE  (T2(n )  log  7 (n )) 
(Theorem  5.2)  be  removed  by  some  other  method? 

6.  What  are  the  corresponding  lower  bounds  on  any  of  these  simulations?  Are  any  of 
the  bounds  optimal? 

7.  As  one  of  the  first  results  of  computational  complexity  theory,  the  linear  speed-up 
theorem  for  Turing  machines  (Hartmanis  and  Steams,  1965)  states  that  for  every  multitape 
Turing  machine  of  time  complexity  T(n)x>n  and  every  constant  c  >  0,  there  is  a  multitape 
Turing  machine  that  accepts  the  same  language  in  time  cT  ( n ).  The  linear  speed-up  property 
of  Turing  machines  justifies  the  widespread  use  of  order-of-magnitude  analyses  of 
algorithms.  Do  PRAMs  also  enjoy  the  linear  speed-up  property? 
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Appendix  A:  Procedure  BOOL 

These  appendices  are  written  in  guarded  command  style  in  order  to  more  clearly  show 
parallel  cases.  The  general  form  of  a  conditional  command  is 
if  B  i  — >  S  i 
[]B2  -4  S2 

[]  Bn  -4  S„ 

fi 

where  B,  is  a  Boolean  expression  and  S,  is  a  command  or  sequence  of  commands.  For  the 
deterministic  case,  exactly  one  guard  B,  is  true.  Similarly,  the  form  of  a  DO  loop  is 
do  B  — >  S 
od 

The  loop  is  repeated  until  guard  B  is  false,  skip  does  nothing. 

%  The  procedure  BOOL  O',  L  L  * )  takes  as  input  Bi(g),  containing  the  list  of 
%  subtrees  formed  by  merging  the  first  levels  of  E(rcong(j))  and  E (rconjk)). 

%  BOOL  returns  the  list  with  each  subtree  labeled  as  interesting  or  boring. 

%  For  each  node  a,  where  a  is  the  root  of  a  subtree  in  the  list,  assume  proc  (a)  is 
%  a  secondary  processor  belonging  to  primary  processor  Pm, 

%  which  corresponds  to  processor  Pg  of  S'. 

%  For  each  a,  proc  (a)  executes  the  steps  specified  below. 

%  The  variables  jjnt,  kjnt,  nextjint ,  nextkjnt,  right  (a),  and  num  (a)  are 
%  global  variables. 

%  jint  tells  if  rcong(j)  is  in  an  interval  of  Q’s  or  1  ’s  at  the  position  specified  by  val  (a). 

%  kjnt  is  defined  similarly. 


if  right  (a)  =  0  -4 

%  That  is,  if  we  want  to  know  whether  the  least  significant  bit  of  rcong(i)  is  0  or  1 . 
if  j  int  =  1  or  kjnt  -  1  —>  result  :=  l 
[J  j  int  =  0  and  kjnt  =  0  —>  result  :=  0 
fi 

[]  right  (a)  *  0  -4 

%  Otherwise,  we  want  to  know  whether  val  (a)  is  the  position  of  an  interesting  bit 
%  of  rcong(i). 

if  val  (a)  =  val  ( node  (num  (a)+l))  — >  result  :=  boring 
%  result  =  boring  means  that  val  (a)  is  not  the  location  of  an  interesting  bit  in  rcong(i)\ 
%  result  =  interesting  means  that  val  (a)  is  the  location  of  an  interesting  bit  in  rcong(t). 
%  proc  (a)  tests  whether  or  not  val  (a)  =  val  ( node  ( num  (a)+l ))  by  calling  COMPARE 
[]  val  (a)  *  val  ( node  (num  (a)+l)) 

%  next  jjnt  is  jjnt  at  val  (node  (num  (a)+l));  nextkjnt  is  defined  similarly, 
if  jjnt  -  1  or  kjnt  =  1  — » 

if  next  jjnt  =  1  or  nextkjnt  =  1  — »  result  :=  boring 
[]  next  jjnt  =  0  and  nextkjnt  =  0-4  result  :=  interesting 

fi 

[]  jjnt  =  0  and  kjnt  =  0-4 
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if  nextjjnt  =  1  or  nextkjnt  =  1  — »  result  :=  interesting 
[]  nextjjnt  =  0  and  nextk  int  =  0  — >  result  :=  boring 

fi 

fi 
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Appendix  B:  Procedure  ADD  (PRAM) 

%  The  procedure  ADD  ( j ,  \\t\ ,  k,  xj/j,  i ,  y3)  takes  as  input  £,(5),  containing 
%  the  list  of  subtrees  formed  by  merging  the  first  levels  of  E (rcong(J)) 

%  and  E(rcong(k)). 

%  ADD  returns  the  list  with  each  subtree  labeled  as  interesting  or  boring, 

%  and  the  value  of  each  interesting  subtree  specifies  the  location  of  an 
°!o  interesting  bit  in  rcong  (i)  =  rcong  (j )  +  rcong  ( k ). 

%  For  each  node  a,  where  a  is  the  root  of  a  subtree  in  the  list,  assume  proc  (a)  is 
%  a  secondary  processor  belonging  to  primary  processor  Pm, 

%  which  corresponds  to  processor  Pg  of  S'. 

%  For  each  a,  proc  (a)  executes  the  steps  specified  below. 

%  The  variables  ijnt,  jjnt,  kjnt,  nextjjnt,  nextkjnt,  carryout ,  pair  jength, 

%  right  (a),  and  num  (a)  are  global  variables. 

%  jjnt  tells  if  rcong(J)  is  in  an  interval  of  0’s  or  1  ’s  at  the  position  specified 
%  by  val  (a);  kjnt  and  ijnt  are  defined  similarly. 

%  pair  length  tells  if  the  interval-pair  length  of  the  interval-pair  ending  at 
%  position  val  (a)  is  one  or  more. 

%  carryin  tells  whether  there  is  a  carry  into  an  interval-pair. 


if  right  (a)  =  0  — > 

%  That  is,  if  we  want  to  know  whether  the  least  significant  bit  of  rcong(i)  is  0  or  1. 
if  jjnt  =  kjnt  — >  result  :=  0 
f]  jjnt  *  kjnt  — »  result  :=  1 

fi 

(]  right  (Ct)  *0 

%  Otherwise,  we  want  to  know  whether  val  (a)  is  the  position  of  an  interesting  bit 
%  of  rcong(i). 

if  val  (a)  =  val  ( node  ( num  (a)+l))  — >  result  :=  boring 
%  result  =  boring  means  that  val  (a)  is  not  the  location  of  an  interesting  bit  in  rcong(i ); 
%  result  =  interesting  means  that  val  (a)  is  the  location  of  an  interesting  bit  in  rconji). 
[]  val  (a)  *  val  ( node  (num  (a)+l))  — » 

%  nextjjnt  is  jjnt  at  val  ( node  ( num  (cc)+l));  nextkjnt  is  defined  similarly. 

if  ( nextjjnt  =  nextkjnt  =  carryout  *  ijnt ) 

or  (( nextjjnt  *  nextkjnt)  A  ( ijnt  =  carryout)) 
if  pairjength  =  1  — »  result  :=  boring 
[]  pairjength  =  more  — > 

E(rcong  (i  ).y3 .  ( right  (a)+ 1 ) 

:=  ADD  (j,  \|/3. (rig ht  (a)+l),  #1,  X,  i,  \\r3  (right (a)+l)) 
result  :=  interesting 
fi 

%  right  (a}+l  specifies  which  element  of  the  merged  list  a  is,  counting  from  the  right 

[]  nextj  jnt  =  nextk  jnt  =  i  int  *  carryout  — > 

\fpairjength=l  — >  result  :=  interesting 
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[]  pairjength  =  more  — » 

result  :=  interesting 
E  (rcorig  (i  ).y3  .{right  (a)+l  V2) 

:=  ADD  (i,  y3.(Wg/ir(a)+l),  #1,  X,  i,  \|/3.(rig/tf  (a)+l)) 
:=  interesting 

%  In  this  case,  an  interesting  bit  occurs  in  the  sum  at  val  (a)  and  val  (a)+l. 

7c  We  say  that  we  insert  the  second  interesting  bit  into  the  merged  list  at 
%  location  right  (a)+V/2. 

%  In  fact,  the  merging  algorithm  leaves  an  empty  slot  between  each  pair  of 
%  consecutive  elements  so  that  such  interesting  bits  may  be  inserted. 


[]  (( nextjjnt  *  nextkjnt)  A  {i_int  *  carryout)) 

or  ( nextjjnt  =  nextk  int  =  i_int  =  carryout)  — »  skip 
fi 
fi 
fi 
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Appendix  C:  Procedure  COMPARE 

%  Assume  a  has  the  form  j,  p  has  the  form  k. 

%  Assume  rcon  (J)  and  rcon  ( k )  are  positive. 

%  COMPARE  recursively  compares  subtrees  of  \(rcon  (j)).\ y\  and  I  {rcon  (£)).\j/2 
%  from  right  to  left 

%  Let  I(a')  =  I  (rcon  (j));  I(P')  =  I  (rcon  (k)). 

%  Return  "greater"  if  v<z/(I(a').Vi )  >  va/(I(P').V2), 

%  "equal"  if  va/flfa').^ )  =  va/(I(p').y2),  or 

%  "less"  if  va/(I(a').Vi )  <  va/(I(p').y2)  at  time  t. 

5 a  :=  SYMBOL  (a,  y  i ,  r); 

5P  -  SYMBOL  ( p,  y2,  0; 

if  5a  =  subtree  and  SP  *  subtree  — »  result  :=  greater 

[]  5a  *  subtree  and  5P  =  subtree  — >  result  s  less 

[]  Sa  =  1  and  SP  =  0  — >  result  :=  greater 

[]  5a  =  0  and  5P  =  1  — >  result  :=  less 

[]  5a  =  0  and  5 p  =  0  — >  result  :=  equal 

[]  5a  =  1  and  5P  =  1  — »  result equal 

[]  5a  =  subtree  and  5P  =  subtree  — > 

A  :=  SYMBOL  (a,  Vi-1.  0; 

B  ;=  SYMBOL  (P,  y2  1.  r); 

%  A  tells  whether  we  are  currently  looking  at  a  run  of  0’s  or  1  ’s  in  rcon  (j)\ 

%  similarly  for  B.  Their  initial  values  tell  if  rcon  (J)  and  rcon  (k)  start  in  a 
%  runofO’sorl’s 

if  A  >  B  — »  result  •=  greater 
[]  A  =  B  — >  result  :=  equal 
[]  A  <  B  —¥  result  :=  less 

fi; 

ay  :=  2;  by  :=  2; 

%  ay  and  by  are  pointers  into  their  respective  encodings 
do  neither  ay  nor  by  reaches  beyond  encoding 
X  :=  COMPARE  (a,  yi  .ay,  P,  y2  by,  t); 

%  If  A  »  B,  then  we  do  not  update  result. 

%  If  A  *  B ,  then  we  update  result  to  indicate  which  is  greater. 

%  In  both  cases,  we  advance  the  pointers  and  update  A  and  B  as  needed. 

if  X  =  greater  —>  by  :=  by  +  1;  B  :=->B 

[]  X  =  equal  — >  a\|/  :=  ay  +  1;  by  :=  by  +  1;  A  :=  -i  A;  S  :=  -i  fl 

[]X  =  less  — >  ay  :=  ay  +  1;  A  :=  -i  A 

fi; 

if  A  =  0  and  B  =  1  — »  result  :=  less 
[]  A  =  1  and  B  =  0-4  result  :=  greater 

fi 

od 


151 


%  at  this  point,  either  ay  or  by  points  past  its  encoding 
if  only  ay  points  past  encoding  — »  result  :=  less 
%  that  is,  I rcon  (J)l  <  \rcon  (k) I,  so  rcon  ( j )  <  rcon  (k) 

[]  only  by  points  past  encoding  — »  result  :=  greater 

%  if  both  point  past  encodings,  then  we  must  test  which  is  greater 
f]  both  ay  and  by  point  past  their  encodings  — > 

X  :=  COMPARE (a.y! ,  ay-1,  [3.y2,  by-1,  t)\ 
if  X  =  greater  or  X  =  less  — »  result  :=  X 
[]  X  =  equal  — »  skip 
fi 

%  otherwise,  result  stays  the  same 

fi 

fi 

Recall  that  a  and  (3  can  have  five  different  forms.  These  are:  y,  y'.<J),  #d,  l+y.0,  and 
l+y.0.  The  above  algorithm  considers  only  the  first  form.  For  a  of  the  form  y'.< J),  call 
COMPARE  ( j ,  0.y! ,  k,  y2,  t).  For  a  of  the  form  #d,  call  CONVERT(#d)  to  convert  the 
constant  d  to  the  interesting  bit  encoding.  For  a  of  the  form  1 +/<}>,  call 
COMPARE  ( ADD  (y.4>.  #  1,  yi ,  t),  X,  k ,  y2,  t ).  For  a  of  the  form  l+y'.4>,  handle  a  the  same 
as  in  the  previous  case,  except  interpret  0’s  as  l’s  and  l’s  as  0’s. 
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Appendix  D:  Procedure  SYMBOL,  Boolean  Case 

%  Assume  y  has  the  form  i. 

%  Suppose  we  have  found  that  processor  Pn  executed  instruction  instr  at  time  r-1 
%  that  wrote  r(i).  instr  was  r(i)<— r(y)  v  f(k). 

%  If  y.\|/  points  to  a  leaf  in  I(y)  at  time  t,  then  return  the  symbol  (0  or  1), 

%  otherwise,  return  an  indication  that  y.y  points  to  a  subtree. 

if  \|/  =  X  — » 

%  X  represents  the  empty  pointer 
j_run  :=  SYMBOL (J,  X,  r-1); 
k_run  -  SYMBOL  (k,  X,  r-1); 

if  j_run  =  subtree  or  kjrun  =  subtree  -4  result  ;=  subtree 
[]  jjun  =  1  or  k_run  =  1  — »  result  :=  1 
[]  j_run  =  0  and  krun  =  0  — >  result  :=  0 

ti 

[]  *  X  -4 

j_run  :=  SYMBOL  (J,  1,  r-1); 
k  jun  :=  SYMBOL (k,  1,  r-1); 

%  j  run  tells  if  we  are  currently  looking  at  a  run  of  0’s  or  1  ’s  in  rcon  (J), 

%  similarly  for  k  run 

if  v  =  1  -4 

%  that  is,  if  we  want  to  know  whether  the  least  significant  bit  of  y  is  0  or  1 
if  jjrun  -  1  or  k  run  =  1  -4  result  :=  1 
[]  j  run  -  0  and  k  run  =  0-4  result  :=  0 

fi 

[]  V*  1  -4 

%  else  we  want  to  know  the  location  of  some  interesting  bit  of  y 
j\f'~  2;  ky  :=  2;  yy  :=  2; 

%  y\y,  and  yy  are  pointers  to  show  where  we  are  currently 
%  looking  in  the  respective  encodings 
result  :=  X; 
do  result  =  X  — » 

old  j  run  ;=  j_run; 
oldkjun  :=  k_run; 

runstop  :=  COMPARE  (j,  j\\t,  k,  ky,  r-1); 

%  runstop  indicates  which  run  of  identical  bits  stops  first 
if  runstop  =  less  — »  string  :=  j 

%  string  tells  the  string,  j,  k,  or  both,  whose  run  of  bits  stops  first 
[]  runstop  =  equal  — »  string  ;=  both 
[]  runstop  =  greater  -4  string  :=  k 

fi; 

if  string  =  y  -4  y_rwn  :=  ->  y_rirn;  y'vjr  :=  y  \jr  +  1 
[]  string  =  both  -4 

jjun  :=  -i  j_run\  k_run  :=  -i  k_run; 
jx\r  :=  jy  +  l;  ky:=ky+l 
[]  string  =  k  -4  ifc_run  ;=  ->  k_run\  ky  :=  ky  +  \ 
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fi; 

if  oldj_run  -  0  and  oldk_run  =  0  — » 
if  FIRST  (y)  =  yvjr 

if  string  =  both  or  string  =  j  —> 

result  :=  SYMBOL  (J-jy,  REST  (y),  r-1) 

[]  string  =  k  — »  resu/r  :=  SYMBOL (k.ky,  REST (\|r),  r-1) 
fi 

[]  FIRST  (\jr)  *  yy  -»  yy  :=  yy  +  1 
fi 

[]  oldj_run  =  1  or  oldk_run  =  1  — > 
if  y_r«n  =  0  and  k_run  =  0  — > 

%  then  an  interesting  bit  occurs  at  the  end  of  the  1  ’s 
if  FIRST  (y)  =  yy  -» 

if  string  =  both  or  string  =  j 
— ¥  result  :=  SYMBOL  (J.  (j'y-1 ),  ££ST  (y),  r-1) 

[]  srring  =  k 

result  :=  SYMBOL  (k.  (ky- 1),  ££ST(y),  r-1) 

fi 

[1  FIRST  (y)  *  yy  — >  yy  :=  yy  +  1 

fi 

[]  jjrun  =  1  or  k  run  =  1  — >  skip 
fi 
fi 

od 

fi 

fi 


I 

I 

I 

I 
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Appendix  E:  Procedure  ADD  (TM) 

%  Assume  a  has  the  form  j,  (5  has  the  form  k. 

%  y  <—  +  P-  so  Y<-  r(J)  +  r(k) 

%  Return  result  :=  I(Y).y. 

if  y  =  X  -4 

jjun  :=  SYMBOL  (y,  X,  f-1); 
k_run  SYMBOL  (k,  X,  f-1); 

%  jrun  tells  whether  we  are  looking  at  a  run  of  0’s  or  1  ’s 
%  in  rcon  (y);  similarly  for  k_run  and  Yjun 

if  j  run  =  1  and  k_run  =  1  — >  res«/f  :=  subtree 
[]  j  run  =  0  or  krun  =  0-4  result  :=  jjun  ■  kjun 
fi 

[]  V  =  1  -» 

jrun  :=  SYMBOL  (J,  1,  f-1); 
k  run  :=  SYMBOL (k,  1,  f-1); 
if  /_rwn  =  k_run  —>  result  :=  0 
[Jy  run  k_run  — »  result  :=  1 
fi 

[j  v  *  X  and  v*  1  — > 

j  run  :=  SYMBOL(j,  1,  f-1); 
k_run  :=  SYMBOL  (k,  1,  f-1); 
if  j  run  =  k  run  — »  Y_rwrt  ;=  0 
f]  j  run  *  k  run  — >  Y_rw/I  :=  1 

fi 

carryin  :=  0;  yy  :=  2;  £y  :=  2;  YV  •'=  2; 

°c  car^’ir.  re"s  'Whether  there  is  a  carry  into  a  run-pair, 

%  yy,  k\ \f,  YV  ^  pointers  into  their  respective  encodings 
finished  :=  false; 

%  finished  is  used  as  a  flag  for  exiting  the  do  statement 
do  -i  finished  — > 

right  jointer  :=  COMPARE  (J,  yy,  k,  k\i,  f-1); 

%  tells  which  pointer  is  to  the  right,  that  is,  end  of  current  run-pair 
oldjeft  :=  COMPARE  (J,  (yy-D,  k,  (/fcy-1),  f-1); 

9c  oldjeft  indicates  end  of  previous  run-pair 

if  [right  jointer  =  less  or  right  jointer  =  equal) 

and  ( oldjeft  =  less  or  oldjeft  =  equal)  — » 

9c  right  jointer  =  less  or  equal  means  val  (y.y  y)  ^  val  [k.k\\t) 

if  COMPAREiJ,  yy,  \+k.  (ky-1),  f-1)  =  equal  -*  pair  length  :=  1 
9c  pairjength  tells  if  the  run-pair  is  of  length  one  or  more 

[]  COMPARE  (J,  yy,  1+k.  (/ty-1),  f-1)  *  equal  — »  pairjength  :=  more 
fi 

[]  (right  jointer  =  less  or  right  jointer  -  equal)  and  oldjeft  =  greater  -4 
if  COMPARER,  yy,  1+y.  (yy_l)>  f-1)  =  equal  -4  pairjength  1 
[]  COMPARE  (J,  yy,  1+y.  (yy-1).  f-1)  *  equal  -4  pairjength  :=  more 

fi 
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[]  right  jointer  =  greater  and  ( oldjeft  =  less  or  oldjeft  =  equal)  -4 

if  COMPARER,  k\\f,  l+k.  (k\\t-l),  r-1)  =  equal  -4  pairjength  :=  1 
[]  COMPARE  (k,  k\\f,  l+k.  (k\\t-l),  r-1)  *  equal  -4  pairjength  :=  more 

fi 

[]  right  jointer  =  greater  and  oldjeft  =  greater  -4 

if  COMPARE  (k,  k\\f,  l+j.  (/y-1),  f-1)  =  equal  -4  pairjength  :=  1 
[]  COMPARE  (k,  k\\t,  l+j.  (yy-1),  r-1)  *  equal  -+  pairjength  :=  more 

fi 


if  vy  =  FIRST  (y)  — > 

%  then  this  is  the  desired  subtree 
if  yy  =  y  —>  ^:=X 
%  then  this  is  the  desired  answer 

[]  W  *  ¥  ->  %  '■=  ^ST  (y) 

%  %  is  a  pointer 

fi; 

finished  :=  true; 

[]  yy  *  FIRST  (y)  — >  skip 
fi 

if  (Jjun  =  krun  =  carryin  *  y_run) 

or  (( J  run  *  k  run)  A  (y run  =  carryin))  -+ 
if  yy  =  FIRST  (y)  -4 

oldjeft  :=  COMPARE  (j,  (/y-1),  k,  Ofcy-1),  r-1); 
if  oldjeft  =  greater  or  oldjeft  =  both  -4 
result  :=  SYMBOL  (j,  (J y-l)4  r-1) 

[]  oldjeft  =  less  — »  result  :=  SYMBOL  (k,  (£y-l).J;,  r-1) 

fi 

[]  yy  *■  FIRST  (y)  -4  Yy  :=  yy  +  1;  y  run  :=  — i  y  run 

fi 

[]  jjun  =  k  run  *  y _run  =  carryin  -4 
if  pair  jength  -  1  -4 
carryin  :=  -i  carryin ; 
if  y_rwn  =  1  -4  finished  :=  false 

%  set  finished  to  false  in  case  it  was  earlier  set  true,  no  interesting  bit  occurs 
%  in  this  case 

[]  y  run  =  0  -4  skip 
fi 

[]  pairjength  =  more  -4 
if  Yy  =  FIRST  (y)  -4 

oldjeft  :=  COMPARE  (J,  (j y-1),  k,  (Jty-1),  r-1); 
if  oldjeft  =  greater  or  oldjeft  =  both 
-+  result  :=  ADD  (j.  (/y-1),  #  1,  £,  r-1) 

[]  oldjeft  =  less  — >  result  =  ADD  {k.  (ty-1),  #1.  q,  r-1) 

fi 
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[]  YV  *  FIRST  (\jr)  — >  YV  :=  YV  +  1;  carry  in  :=  carrying 

yjun  \=  —i  yjun 

fi 

fi 

[]  jrun  =  kjun  =  yjun  *  carryin  — » 
if  pairjength  =  1  — » 
if  YV)/ =  F//?ST  (\*/)  -> 

oldjeft  :=  COMPARE (j,  (/V- D,  (*V-1),  r-1); 
if  oldjeft  =  greater  or  oldjeft  =  both 
-»  result  \=  SYMBOL  (j,  (y'y-l).^,  r-1) 

[]  oldjeft  =  less  ->  result  :=  SYMBOL  (k,  (k\\t-l).^  r-1) 

fi 

[1  YV  *  FIRST  (\jr)  — »  YV  :=  YV  +  U  carryin  :=  ->  carryin ; 

Y_rw/i  ;=  -i  Y_™rt 
fi 

[]  pairjength  =  more  — > 
if  YV  =  FIRST  (\| /)  -» 

oldjeft  COMPARE  (j,  <Jy-\),  k,  (ky-l),  t- 1); 
if  oldjeft  =  greater  or  oldjeft  =  both 
-»  result  :=  SYMBOL  (J,  (jV-1).£,  r-1) 

[]  old  left  =  less  — >  result  :=  SYMBOL  (£,  (&\|/-1).J;,  r-1) 
fi 

%  pick  up  the  isolated  interesting  bit 
YV  :=  W+  I: 
if  YV  =  FIRST  (y)  — > 
if  YV  =  V  —*  ^  X 
[1  YV 56  V  C  :=  #£57  (y) 
fi; 

finished  :=  true; 

%  to  get  out  of  do  loop 

old  left  ;=  COMP  ARE  (J,  (j\\l-l),  k,  (ky-l),  r-1); 
if  oldjeft  =  greater  or  oldjeft  =  both 
— »  result  \=  ADD  #1,  r-1) 

[]  oldjeft  =  less  — >  resw/r  :=  ADD  (k.  (£\|/-1),  It  1,  2;,  r-1 ) 

fi 

1]  YV  ^  FIRST  (\|/)  — »  YV  :=  YV  +  1 ;  carryin  :=  — >  carryin 

fi 

fi 

fi 

(]  ({jrun  *  kjun)  a  (y_run  *■  carryin)) 

or  (J_run  =  kjun  =  Y_/wri  =  carryin)  — >  skip 

fi, 


%  now  shift  yy  and  £\jr  as  necessary  to  point  to  the  next  interesting  bit  in 
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%  their  respective  encodings 

if  right _ pointer  =  less  — >  j\ {/  :=  j\\t  +  1;  j_run  :=  — i  j_run 
%  that  is,  if  the  run  of  identical  bits  in  rcon  (J)  stops  before  the 
%  run  of  identical  bits  in  rcon  (k) 

[]  right  -pointer  =  equal  — » 
jy:=jy+  1;  ky:=ky  +  1; 
j_run  :=  -i  j_run\  k_run  :=  -i  k_run 
[]  right _pointer  =  greater  ky  :=  k\y  +  1;  k_run  :=  k_run 

fi; 

%  now  handle  the  instance  when  at  least  one  of  j\\t,  ky  points  beyond  its  encoding 
if  -i  finished  -* 

if  j\\t  and  k\\f  point  beyond  their  encodings  — > 

%  if  both  point  beyond  their  encodings,  we  must  check  if  a  carry  out 
%  from  the  last  run-pair  causes  one  more  interesting  bit  to  occur  in  I(y) 
if  carryin  =  0  — > 

result  :=  beyond;  finished  :=  true 

%  beyond  means  that  I(y)  has  fewer  than  FIRST  (y)-l  interesting  bits 
[]  carryin  -  1  — ► 

%  carryin  =  1  and  rcon  (y)  has  a  1,  which  is  an  interesting  bit, 

%  in  the  location  beyond  the  interesting  bits  of  rcon  (J)  and  rcon  ( k ) 

ifyy*  FIRST (\t)  — »  result beyond;  finished  :=  true 
[]yy  =  FIRST  (\\f)  -> 

%  this  is  the  bit  we  are  looking  for 

old  left COMPARE  (j,  D,  k,  (ky- 1),  r-1); 
if  oldjeft  =  greater  or  oldjeft  =  both 
-4  result  :=  ADD  (J.  (/V“l).  #  1,  s-  * -1) 

[]  oldjeft  =  less  — *  result  :=  ADD  (k.  (k\\t-l),  #  1,  r-1) 

fi 
fi 
fi 

[]  k\]t  does  not  point  beyond  its  encoding  — » 

result  :=  SYMBOL  ( k ,  FIRST (v/)  -  yy  +  k\y  -  2).REST (y),  t-l ); 

%  FIRST  (\|/)  -  YV  -  1  gives  the  remaining  interesting  bits  to  pass 
%  over  in  I(y);  k(\\t)-  1  gives  the  interesting  bits  in 
%  I  {rcon  (k))  already  accounted  for 
finished  ;=  true 

[]  does  not  point  beyond  its  encoding 

result  :=  SYMBOL  {j,  FIRST (y)  -  y\|/  +  yy  -  1)J?£5T(\|/),  r-1); 
finished  :=  true 

fi 

[]  finished  — »  skip 
fi 
od 
fi 
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